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Preface 


A major aim of this book is to unify and extend latent variable modeling 
in the widest sense. The models covered include multilevel, longitudinal and 
structural equation models as well as relatives and friends such as generalized 
linear mixed models, random coefficient models, item response models, factor 
models, panel models, repeated measurement models, latent class models and 
frailty models. 

Latent variable models are used in most empirical disciplines, although often 
referred to by other names. In the spirit of the title of the 4 Interdisciplinary 
Statistics Series’，we attempt to synthesize approaches from different disci¬ 
plines and to translate between the languages of statistics, biometrics, psy¬ 
chometrics and econometrics (although we do not claim to have full command 
of all these languages). 

We strongly believe that progress is hampered by use of 4 local’ jargon lead¬ 
ing to compartmentalization. For instance, econometricians and biostatisti¬ 
cians are rarely seen browsing each other’s journals. Even more surprising is 
tribalism within disciplines, as reflected by lack of cross-referencing between 
item response theory and factor modeling in psychometrics (even within the 
same journal!). A detrimental effect of such lack of communication is a lack 
of awareness of useful developments in other areas until they are ‘translated’ 
and published in the ‘correct’ literature. For instance, models for drop-out 
(attrition in social science) have been prominent in econometrics for decades 
but have only quite recently been ‘discovered’ in the statistical literature. 

The book consists of two parts; methodology and applications. In Chapter 
1 we discuss the concept, uses and interpretations of latent variables. In Chap¬ 
ter 2 we bring together models for different response types used in different 
disciplines. After reviewing classical latent variable models in Chapter 3, we 
unify and extend these models in Chapter 4 for all response types surveyed 
in Chapter 2. Established and novel methods of model identification, estima¬ 
tion, latent variable prediction and model diagnostics are extensively covered 
in Chapters 5 to 8. 

In the application Chapters 9 to 14 we use the methodology developed in 
the first part to address problems from biology, medicine, psychology, edu¬ 
cation, sociology, political science, economics, marketing and other areas. All 
applications are based on real data, but our analysis is often simplified for 
didactic reasons. We have used our Stata program glia mm, developed jointly 
with Andrew Pickles, for all applications. 

It is our hope that ample cross-referencing between the two parts of the 
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book will allow readers to find illustrations of methodology in the application 
chapters (by skipping forward) and statistical background for applications in 
the methodology chapters (by skipping back). 

The first three and a half chapters are intended as a relatively gentle in¬ 
troduction to the modeling approach pervading this book. The remaining 
methodological chapters are somewhat more technical, partly due to the gen¬ 
erality of the framework. However, we also consider simple special cases where 
notation becomes less complex, ideas more transparent and results more in¬ 
tuitive. Readers who are primarily interested in the interpretation and appli¬ 
cation of latent variable models might want to skip most of Chapters 4 to 8 
and concentrate on the application chapters. 

The book is one of the outcomes of our collaboration over the last four years 
on developing c Generalized Linear Latent And Mixed Models ， (GLLAMMs) 
and the accompanying gllamm software. We acknowledge the input from our 
collaborator Andrew Pickles. Anders would like to thank his 4 boss’ Per Mag¬ 
nus and Sophia her former 4 boss’ Brian Everitt for encouragement and accept¬ 
ing that this book took priority over other projects. Brian Everitt, Leonardo 
Grilli，Carla Rampichini and two anonymous reviewers have read drafts of the 
book and provided us with many helpful suggestions. We would also like to 
acknowledge constructive comments from Irit Aitkin, Bill Browne, Stephen 
Jenkins, Andrew Pickles, Sven Ove Samuelsen and Jeroen Vermunt. 

David Clayton, Per Kragh Andersen, Anthony Heath, Andrew Pickles and 
Bente Traeen have kindly provided data for our applications. We appreci¬ 
ate that Patrick Heagerty, the BUGS project, Muthen &: Muthen, Journal of 
Applied Econometrics and the Royal Statistical Society have made data ac¬ 
cessible via the internet. We have also used data from the UK Data Archive 
and the Norwegian Social Science Data Service with efficient help from Helene 
Roshauw. Thanks are due to Jasmin Naim and Kirsty Stroud at Chapman 
& Hall/CRC who have ensured steady progress through frequent but gentle 
reminders. We also thank the developers of IM]eK for providing this invaluable 
free tool for preparing manuscripts. 

We have written each chapter together and contributed about equally to 
the book. Although writing this book has been hard work, we have had a lot 
of fun in the process! 


The gllamm software, documentation, etc.，can be downloaded from: 

http://www.gllamm.org 

Datasets and scripts for some of the applications in this book are available at: 
I http : //www.gllamm.org/booksI 


Oslo and Berkeley 
January 2004 


Anders Skrondal 
Sophia Rabe-Hesketh 
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CHAPTER 1 


The omni-presence of latent variables 


1.1 Introduction 

Since this book is about latent variable models it is natural to begin with a 
discussion of the meaning of the concept ‘latent variable’. Depending on the 
context, latent variables have been defined in different ways, some of which will 
be briefly described in this chapter, although we generally find the definitions 
too narrow (see also Bollen ， 2002). In this book we simply define a latent 
variable as a random variable whose realizations are hidden from us. This is 
in contrast to manifest variables where the realizations are observed. 

Scepticism and prejudice regarding latent variable modeling are not uncom¬ 
mon among statisticians. Latent variable modeling is often viewed as a dubious 
exercise fraught with unverifiable assumptions and naive inferences regarding 
causality. Such a position can be rebutted on at least three counts: First, any 
reasonable statistical method can be abused by naive model specifications 
and over-enthusiastic interpretation. Second, ignoring latent variables often 
implies stronger assumptions than including them. Latent variable modeling 
can then be viewed as a sensitivity analysis of a simpler analysis excluding 
latent variables. Third, many of the assumptions in latent variable modeling 
can be empirically assessed and some can be relaxed, as we will see in later 
chapters. 

Latent variable modeling is furthermore often viewed as a rather obscure 
area of statistics, primarily confined to psychometrics. However, latent vari¬ 
ables pervade modern mainstream statistics and are widely used in different 
disciplines such as medicine, economics, engineering, psychology, geography, 
marketing and biology. This 4 omni-presence ? of latent variables is commonly 
not recognized, perhaps because latent variables are given different names in 
different literatures, such as random effects, common factors and latent classes. 

In this chapter we will demonstrate that latent variables are used to repre¬ 
sent phenomena such as 

• 4 True，variables measured with error 

• Hypothetical constructs 

• Unobserved heterogeneity 

• Missing data 

• Counterfactuals or ‘potential outcomes’ 

• Latent responses underlying categorical variables 
Latent variables are also used to 
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the conventions of path diagrams, the circle represents the latent variable 
rjj, the rectangles represent the observed measurements yij and the arrows 
represent linear relations (here with regression coefficients set to 1) • The label 
4 unit j •’ implies that all variables inside the box vary between units and the 
subscript j is hence omitted in the diagram. 

Measurement models are usually specified with continuous latent variables 
r]j. Such models are called factor models (see equation (1.1) on page 4) when 
the observed measures are continuous and item response models when the 
measures are categorical. Factor and item response models are discussed in 
more detail in Section 3.3. 

Sometimes the true variable is instead construed as categorical, a typical 
example being medical diagnosis (ill versus not ill). The measurement is in this 
case usually also categorical with the same number of categories as the true 
variable. Measurement error then takes the form of misclassification. Mea¬ 
surement models with categorical latent and measured variables are known as 
latent class models and will be treated in Section 3.4. An application to the 
diagnosis of myocardial infarction (heart attack) is presented in Section 9.3. 

A basic assumption of measurement models, both for continuous and cate¬ 
gorical variables, is that the measurements are conditionally independent given 
the latent variable, i.e. the dependence among the measurements is solely due 
to their common association with the latent variable. This is reflected in the 
path diagram in Figure 1.1 where there are no arrows directly connecting the 
observed variables. This conditional or local 5 independence property is the 
basis of the local independence definition of latent variables (e.g. Lord, 1953; 
Lazarsfeld, 1959). 

Measurement modeling can be used to assess measurement quality. If the 
true variable is continuous, measurement quality is typically assessed in terms 
of the reliability of individual measures; see Section 3.3. An application for 
fibre intake measurements is presented in Section 14.2, and other applications 
are given in Sections 10.3 and 10.4. If the true variable is categorical, mea¬ 
surement quality is typically formulated in terms of the misclassification rate, 
sensitivity and specificity. This is investigated in the context of diagnosis of 
myocardial infarction in Section 9.3 (see also Section 13.5). 

Measurement models can also be combined with regression models to avoid 
regression dilution when a covariate has been measured with error (e.g. Carroll 
et al” 1995a; see Section 3.5). In Section 14.2 we discuss models for the effect 
of dietary fibre intake on coronary heart disease when fibre intake is measured 
with error, with replicate measurements available for a subgroup. Sometimes 
a 4 gold standard’ measurement is available for a Validation sample’，whereas 
the fallible measurement is available for the whole sample. In the validation 
sample, the true value of the covariate is therefore observed, whereas it is 
represented by a latent variable in the validation sample，giving a model of 
the same form as missing covariate models; see also Section 1.5. We will discuss 
such a model for a case-control study of cervical cancer in Section 14.3. 

Covariate measurement error models often make relatively strong assump- 
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tions such as conditional independence of the measures given the true value, 
nondifferential measurement error (that the measured covariate is condition¬ 
ally independent of the response variable given the true covariate), normally 
distributed measurement errors and normally distributed true covariates. Many 
of these assumptions can be assessed and/or relaxed. For example, we will 
relax the nondifferential measurement error assumption in Section 14.3. In 
Section 14.2, we use nonparametric maximum likelihood estimation to leave 
the distribution of the true covariate unspecified when replicate measurements 
are available. Furthermore, the ‘naive’ analysis ignoring measurement error is 
likely to (but need not!) produce greater biases than a misspecified model 
taking measurement error into account. The latter can be seen as a sensitivity 
analysis for the former. 

1.3 Hypothetical constructs 

In contrast to true variables measured with error which are presumed to exist 
(be ontological), hypothetical constructs have an exclusively epistemological 
status (e.g. Messick, 1981). Treating hypothetical constructs as real would thus 
entail a reification error. The scepticism regarding ‘latent variables’ among 
many statisticians can probably be attributed to the metaphysical status of 
hypothetical constructs. On the other hand, communication seems impossible 
without relying on hypothetical constructs. For instance, the concept of a 
4 good statistician’ is not real, but nevertheless useful and widely understood 
among statisticians (although not easily defined). 

According to Cronbach (1971) a construct is an intellectual device by means 
of which one construes events. Thus, constructs are simply concepts. Relation¬ 
ships between constructs provide inductive summaries of observed relation¬ 
ships as a basis for elaborating networks of theoretical laws (e.g. Cronbach 
and Meehl, 1955). Nunnally and Durham (1975, p.305) put it the following 
way: 

“… the words that scientists use to denote constructs, for example, ‘anxiety’ and 

‘intelligence’，have no real counterpart in the world of observables; they are only 

heuristic devices for exploring observables.” 

Since hypothetical constructs do not correspond to real phenomena, it fol¬ 
lows that they cannot be measured directly even in principle (e.g. Torgerson, 
1958; Goldberger, 1972). Instead, the construct is operationally defined in 
terms of a number of items or indirect 4 indicators’ such as answers in an intel¬ 
ligence test. The relationship between the latent construct and the observed 
indicators is usually modeled using a common factor model (Spearman, 1904), 

Uij = XiVj + e ij ， (1.1) 

where r]j is the latent variable or 4 common factor’ representing the hypothet¬ 
ical construct, is a factor loading for item i and Cij is a unique factor, 
representing specific aspects of item i and measurement error; see also Sec¬ 
tion 3.3.2. The factor model can be represented by the same path diagram 
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as the classical measurement model in Figure 1.1 where the paths from the 
factor to the indicators could now be labeled with the factor loadings. 

Hypothetical constructs are prominent in psychological research. In fact, 
it seems fair to say that most research in psychology and similar disciplines 
is concerned with hypothetical constructs such as ^elf-esteem 5 , ‘personality’ 
and ( life-satisfaction’ (see Section 10.4). Sociologists are often concerned with 
constructs such as ‘aspiration’ and 4 alienation’，whereas political scientists are 
interested in ‘political efficacy’ (see Section 10.3). In education, researchers 
are interested in constructs such as 4 arithmetic ability’ (see Section 9.4). It 
should be noted that hypothetical constructs are also used in ‘harder’ sci¬ 
ences such as economics to represent for instance ‘permanent income’ (e.g. 
Goldberger, 1971) and ‘expectations’ (e.g. Griliches, 1974). Hypothetical con¬ 
structs are also important in medicine. Examples include 4 depression 5 (e.g. 
Dunn et al, 1993) and 4 quality of life’ (e.g. Fayers and Hand, 2002). 

So far we have discussed continuous hypothetical constructs which appears 
to be the most common situation. However, it is sometimes more natural to 
consider categorical constructs or typologies. In sociology a prominent exam¬ 
ple is ‘social class’ (e.g. Marx, 1970). In psychology, ‘stages of change’ (pre¬ 
contemplation, contemplation, preparation, action, maintenance and relapse) 
are thought to be useful for assessing where patients are in their ‘journey’ to 
change health behaviors such as trying to quit smoking (e.g. Prochaska and 
DiClemente ， 1983). In business, it is common practice to classify customers 
into ‘market segments’，either for targeted marketing or for tailoring prod¬ 
ucts. For instance, Magidson and Vermunt (2002) used latent class models to 
classify bank customers and found four segments: Value seekers 5 , Conservative 
savers’ ， c mainstrearners ? and ‘investors’. We consider an application of market 
segmentation for coffee makers in Section 13.6. In medicine, functional syn¬ 
dromes such as irritable bowel syndrome, which are characterized by a set of 
symptoms (whose cause is unknown) ， can be viewed as categorical hypotheti¬ 
cal constructs. Here the fact that certain symptoms have high probabilities of 
occurring together is taken as an indication that they may be caused by the 
same disorder. 

Instead of defining hypothetical constructs on theoretical grounds, they are 
sometimes ‘derived，from an exploratory analysis, the classical example be¬ 
ing the use of exploratory factor analysis when the latent variables are con¬ 
strued as continuous. The analogue for categorical latent variables is to use 
exploratory latent class analysis to derive categorical constructs or typologies. 
The danger of developing theory in this way has been vividly demonstrated by 
Armstrong (1967). He used exploratory factor analysis in an example where 
the underlying factors were known, the underlying model simple and provid¬ 
ing a perfect fit to the data. While the exploratory factor analysis ‘explained’ 
a large proportion of the total variance, it failed spectacularly to recover the 
known factors. It is well worth citing Armstrong’s summary: 

“The cost of doing factor analytic studies has dropped substantially in recent 

years. In contrast with earlier time, it is now much easier to perform the factor 
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analysis than to decide what to factor analyze. It is not clear that the resulting 
proliferation of the literature will lead us to the development of better theories. 

Factor analysis may provide a means of evaluating theory or of suggesting re¬ 
visions in theory. This requires, however, that the theory be explicitly specified 
prior to the analysis of data. Otherwise, there will be insufficient criteria for 
the evaluation of the results. If principal components is used for generating hy¬ 
potheses without an explicit a priori analysis, the world will soon be overrun by 
hypotheses.” 

Indeed, a perusal of contemporary psychology journals definitely suggests that 
his prophecy has been fulfilled; this part of the world has already been overrun 
by hypotheses! 

As a concrete example Armstrong mentioned a study by Cattell (1949) 
who attempted to discover primary dimensions of culture. The 12 basic fac¬ 
tors obtained were rather mysterious, including gems such as ‘enlightened 
affluence’，thoughtful industriousness’ and ‘bourgeois philistinism’. Unfortu¬ 
nately, questionable applications of this kind still abound in psychology. A 
prominent recent example is the ruling 4 big-five theory，in personality psy¬ 
chology (e.g. Costa and McRae, 1985), which ardently advocates that per¬ 
sonality is characterized by the five dimensions ‘extraversion’ ， ‘agreeableness ’， 
4 conscientiousness 5 , ‘neurotisism’ and 4 openness to experience’. This ‘theory’ 
has to a large extent been derived via exploratory factor analysis, but the 
factors have nevertheless been given ontological status (interpreted as real). 
Vassend and Skrondal (1995, 1997, 2004) are critical of the big-five theory 
and argue that the conventional analysis of personality instruments is fraught 
with conceptual and statistical problems. 

Although continuous hypothetical constructs are usually modeled by com¬ 
mon factors, this is not always the case. Several other multivariate statistical 
methods have been used to explore the ‘dimensions’ underlying data. Ex¬ 
amples include principal component analysis (e.g. Joliffe ， 1986)，partial least 
squares (PLS) (e.g. Lohmoller, 1989), canonical correlations (e.g. Thompson, 
1984)，discriminant analysis (e.g. Klecka ， 1980) and multidimensional scaling 
(e.g. Kruskal and Wish, 1978). Categorical Constructs 5 can be derived us¬ 
ing cluster analysis, finite mixture modeling or multidimensional scaling (e.g. 
Shepard, 1974). However, we do not consider the 4 dimensions’ or ‘groupings’ 
produced by these methods as latent variables since they merely represent 
transformations or geometric features of the data and not elements in a sta¬ 
tistical model. In fact, Bentler (1982) defines a latent variable as a variable 
that cannot be expressed as a function of manifest variables only. Another 
limitation is that the methods are usually strictly exploratory, not permitting 
any hypothesized structure, based on research design, previous research or 
substantive theory, to be imposed and tested. This problem is also shared by 
exploratory factor and latent class analysis. Different types of statistical mod¬ 
els are contrasted in Section 8.2.1 and different modeling strategies discussed 
in Section 8.2.2. 

Acknowledging that hypothetical constructs are useful in many disciplines, 
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Several types of validity have been outlined in psychology, including Con¬ 
tent 5 , 4 convergent 5 , 4 concurrent 5 , discriminant 5 , ‘predictive，and 4 nomological ， 
(American Psychological Association et al. ， 1974). Construct validity has be¬ 
come the core of a hierarchical and unifying view of validity, integrating these 
types (e.g. Silva, 1993). 

An advantage of latent variable modeling is that we can investigate the 
tenability of hypothesized structures, either by assessing model fit (see Sec¬ 
tion 8.5) or by elaborating the model in various ways. For instance, conver¬ 
gent validity can be assessed by specifying models where indicators designed 
to reflect a given construct only reflect that construct and not others. This is 
illustrated in the upper panel of Figure 1.2 where the first factor is measured 
by items 1 to 3 whereas the second factor is measured by items 4 to 6. If 
this model is rejected in favor of the model in the lower panel, where item 5 
loads on both factors, then convergent validity does not hold. Two alternative 
courses of action could be taken in this case, either accommodating the item 
as in the bottom panel or discarding it. The latter approach, common in item 
response modeling, could be criticized for being ‘self-fitting’ (e.g. Goldstein, 
1994) but may be justified if the theory is well founded. 

Discriminant validity may be investigated by inspecting the uniqueness of 
the constructs, in the sense that the estimated correlations among constructs 
should^not be too large. In the lower panel of the figure，the estimated covari¬ 
ance 众 i2 is large and the discriminant validity is questionable. 

Hid ， (hypothesized) ‘Invalid ， 




Figure 1.3 Nomological validity 

Nomological validity is typically assessed by investigating the tenability of 


© 2004 by Chapman & Hall/CRC 







the structural model that incorporates the theory induced relationships among 
the constructs. For example, consider the model of 4 complete mediation’ (e.g. 
Baron and Kenny, 1986) represented in the left panel of Figure 1.3, positing 
that there is no direct effect of rji (e.g. 4 anomalous parental bonding’) on 
7]s (e.g. 4 depression , ), but only an indirect effect via the mediator 772 (e.g. 
‘personality’). If this model is rejected in favor of a model with both direct 
and indirect effects of 771 on 773 , as shown in the right panel, the complete 
mediation model does not have ‘nomological ， validity. 

We refer to Bagozzi (1981) for an application examining the construct va¬ 
lidity of the expectancy-value and semantic-differential models of attitude. 


1.4 Unobserved heterogeneity 

A major aim of statistical modeling is to 4 explain’ the variability in the re¬ 
sponse variable in terms of variability in observed covariates, sometimes called 
‘observed heterogeneity’. However, in practice, not all relevant covariates are 
observed, leading to unobserved heterogeneity. Including latent variables, in 
this context typically referred to as random effects, in statistical models is a 
common way of taking unobserved heterogeneity into account. 

Random effects models are widely used for a variety of problems. Exam¬ 
ples include longitudinal analysis (e.g. Laird and Ware, 1982) ， meta-analysis 
(e.g. DerSimonian and Laird, 1986) ， capture-recapture studies (e.g. Coull and 
Agresti ， 1999), conjoint analysis (e.g. Green and Srinivasan, 1990)，biometrical 
genetics (e.g. Neale and Cardon ， 1992) and disease mapping (e.g. Clayton and 
Kaldor, 1987). Applications of models for unobserved heterogeneity are given 
in Sections 9.2 and 11.3 for longitudinal studies, Section 9.5 for meta-analysis ， 
Section 9.7 for capture-recapture studies, Section 11.4 for small area estima¬ 
tion and disease mapping, and Section 13.6 for conjoint analysis in marketing. 

Note that unobserved heterogeneity is not a hypothetical construct since 
it merely represents the combined effect of all unobserved covariates, and is 
not given any meaning beyond this. The random effects in genetic studies 
perhaps occupy an intermediate position, since they are interpreted as shared 
and unshared genetic and environmental influences. 

When the units are clustered, shared unobserved heterogeneity may induce 
4 intra-cluster’ dependence among the responses, even after conditioning on 
observed covariates. This is illustrated in Figure 1.4 for ten clusters with two 
units each and no covariates. Here, heterogeneity is reflected in the scatter of 
the cluster means (shown as horizontal bars) around the overall mean (hori¬ 
zontal line), leading to within-cluster correlations because both responses for 
a cluster tend to lie on the same side of the overall mean. This phenomenon 
is common for longitudinal or panel data, where observations for the same 
unit are influenced by the same (shared) unit-specific unobserved heterogene¬ 
ity. An example involving repeated measurements of respiratory infection in 
Indonesian children is given in Section 9.2. Other examples of clustered data 
include individuals in households, or children in schools. 
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Figure 1.4 Between-cluster heterogeneity and within-cluster correlation 


A different kind of clustered data are counts, such as number of epileptic fits 
for a person over a week. Although there is a single count for each unit, each 
count comprises several events whose occurrence is likely to be influenced by 
shared unit-specific covariates. The resulting variability in 4 proneness’ to ex¬ 
periencing the event can lead to 4 over dispersion’，in a Poisson model meaning 
that the variance is larger than the mean (see Sections 2.3.1 and 11.2). These 
two consequences of unobserved heterogeneity, within-cluster dependence and 
overdispersion, lead to incorrect inferences if not properly accounted for. One 
way of accounting for unobserved heterogeneity is to include a random inter¬ 
cept in a regression model. In the case of clustered data, units in the same 
cluster must share the same value or realization of the random effects. In Fig¬ 
ure 1.4, the random intercept would represent the deviations of the cluster 
means (horizontal bars) from the overall mean. 

In multilevel or hierarchical data there are often several levels of clustering, 
an example being panel data on individuals in households. We can then use 
latent variables at each of the higher levels to represent unobserved hetero¬ 
geneity at that level. A simple three-level random intercept regression model 
for panel wave individual j and household k can be written as 

Vijk = VOjk + PlXijk + Cijk, 

VOjk = 700 + Cjk - 

In the level-1 model for yijk, Xijk is a covariate with 4 fixed 5 regression coeffi¬ 
cient Pi and rjojk is a random intercept with mean 700 and residuals and 
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at levels 2 and 3, respectively. The random part of this model (not showing 
PiXijk and 700) is presented in path diagram form in Figure 1.5 for a house¬ 
hold with three individuals participating at 3, 1 and 2 occasions, respectively. 
Multilevel models are discussed in greater detail in Section 3.2. 



Figure 1.5 Path diagram of three-level random intercept model 

The effect of a covariate on the response can also differ between clusters 
which can be modeled by including cluster-specific Random coefficients 5 . For 
example, the change in epileptic seizure rate over time may vary across sub¬ 
jects as investigated in Section 11.3. Analogously, the effect of political dis¬ 
tance (between party and voter) on party preference may vary across con¬ 
stituencies as discussed in Section 13.4. Sometimes the unobserved hetero¬ 
geneity is discrete with units falling into distinct clusters (see Section 12.4 for 
an application). 

An important consequence of unobserved heterogeneity is that relationships 
between the response and the observed covariates are usually different at the 
unit (or cluster) and population levels. A prominent example is frailty in 
survival analysis (e.g. Aalen, 1988), where the population-level hazard can 
differ drastically from the unit-level hazards due to unexplained variability in 
the latter. The reason for this is that some individuals are more 4 frail’ than 
others, being more susceptible to the event than can by explained by their 
observed covariates. These individuals will tend to experience the event early 
on, leaving behind the less frail. Consequently, even if individual hazards are 
constant, the population average hazard will decline over time. We consider 
multivariate frailty models for the treatment of angina in Section 12.4. 

The distinction between effects at the unit (or cluster) and population levels 
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is also important for dichotomous responses. If the unit-specific models are 
probit regressions with different intercepts (due to unobserved heterogeneity) 
but sharing a common coefficient for a single covariate, then the population 
averaged model is a probit regression with an attenuated coefficient. This 
is illustrated in Figure 1.6 where the population averaged curve, shown in 
bold, has a smaller 4 slope 5 than the unit-specific curves (see Section 4.8.1 
for a derivation). Whether unit-specific or population-averaged effects are of 
interest will depend on the context. For example, population averaged effects 
are often of concern in public health where the focus is on the population level. 
In a clinical setting, on the other hand, patient-specific effects are obviously 
more important for the individual patient and her physician. Importantly, 
since causal processes necessarily operate at the unit and not the population 
level, it follows that investigation of causality requires unit-specific effects. 
If there are repeated observations on the units, unit-specific effects can be 



Figure 1.6 Unit-specific versus population- average probit regression 

estimated by including random effects in the models. 

In the preceding examples, we have considered units observed on multiple 
occasions and we will now return to the more general case of units (possibly 
occasions) nested in clusters (possibly units). It is important to note that re¬ 
lationships between covariates and response variables may be different at the 
unit and cluster level. Inferences regarding effects at the unit level based on 
aggregated data at the cluster level may therefore lead to the so called c eco- 
logical fallacy’ （ e.g. Robinson, 1950). Robinson’s classical example concerned 
the correlation between the percentage of black people and illiteracy at the 
region level, estimated as 0.95, which was very different from the estimated 
individual-level correlation between being black and individual illiteracy, es¬ 
timated as 0.20. 

Figure 1.7 illustrates that within-cluster effects can be very different from 
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between-cluster effects, possibly having opposite directions. The clusters in the 



Figure 1.7 Within-cluster and between-cluster effects 

figure could represent countries, the response y could be a health outcome such 
as length of life and the explanatory variable x could be exposure to unhealthy 
luxuries such as red meat. Within a country, increasing exposure is associated 
with decreasing health (as reflected by the downward slopes of the dotted 
lines). Between countries, on the other hand, increasing average exposure is 
associated with increasing average health (as reflected by the upward slope 
of the dashed line) since it is also associated with increasing average living 
standards. 

Another important example would be longitudinal data where the within- 
unit decline could represent an age affect，whereas the between-unit increase 
could be a cohort effect. If only cross-sectional data are available, we cannot 
distinguish between age and cohort effects. For example, greater conservatism 
in older people as compared to younger people could be due to being at a later 
stage in their lives or due to being born into a different epoch. More formally, 
consider the longitudinal model 

Vij = Po + PcXlj + - X-ij) + Q + 6ij, 

where yij is a response for unit j at occasion i, Xij could be age, and Q is 
a subject-specific random intercept. The longitudinal design allows separate 
estimation of the cross-sectional (or cohort) effect f3c and the longitudinal 
effect Pl- This distinction is closely related to the problem of correlation 
between random effects and covariates discussed on page 52. 

In random effects modeling it is typically assumed that the random effects 
have multivariate normal distributions. Model diagnostics (see Section 8.6) 
can be used to assess this assumption, although they may not be sufficiently 
sensitive. Fortunately, inferences are in many cases quite robust to misspec- 
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ifications of the random effects distribution (e.g. Bartholomew, 1988, 1994). 
We can moreover relax the distributional assumption by using nonparametric 
maximum likelihood estimation (see Section 6.5). This approach is used in 
modeling faulty teeth in children in Section 11.2, epileptic seizures in Sec¬ 
tion 11.3 and diet and heart disease in Section 14.2. 


1.5 Missing values and counterfactuals 

Latent variables can represent missing values of partially observed variables, 
that is variables that are observed on a subset of the units. Usually, the missing 
values are presumed to have been 4 realized’ but for some reason not recorded. 
However, missing values are sometimes values that would have been realized 
under 4 counterfactuaP circumstances, for instance if a covariate had had a 
different value. 

If a covariate is missing for some units, these units typically cannot con¬ 
tribute to parameter estimation. This loss of units leads to reduced efficiency 
which can be overcome by filling in missing covariate values using (multiple) 
imputation (e.g. Rubin, 1987; Schafer, 1997). The units can then contribute in¬ 
formation on the relationship between the responses and the other covariates. 
Instead of imputing covariate values, we could jointly estimate the imputation 
model with the model of interest, integrating the likelihood over the impu¬ 
tation distribution，for the missing values. Here the missing values can be 
represented by a latent variable assumed to have the same distribution as the 
observed values and the same relationship to the responses (e.g. regression pa¬ 
rameter) in the model of interest. An example of a missing covariate problem 
is covariate measurement error when there is a validation sample in which the 
true covariate is observed (see Section 14.3). Another example is estimation 
of 4 complier average causal effects，(Imbens and Rubin, 1997b) in randomized 
interventions with noncompliance where compliance is not observed in the 
control group (see Section 14.4). Here compliance status in the control group 
can also be viewed as ‘counterfactual’. 

If responses are missing for some units, the units can contribute to param¬ 
eter estimation as long as there is at least one observed response, leading to 
consistent parameter estimates if the data are missing at random (MAR) (e.g. 
Rubin, 1976; Little and Rubin, 2002). However, if the responses are not miss¬ 
ing at random (NMAR), ignoring the missing data mechanism can lead to 
biased parameter estimates. This can be addressed by joint modeling of the 
substantive and missingness processes. In the ‘selection model’ approach (e.g. 
Little, 1995), the dependence of the missingness process on the unobserved 
response is explicitly modeled. 

For example, in a longitudinal setting where responses are missing due to 
dropout or attrition, Hausman and Wise (1979) introduced a model in econo¬ 
metrics that was later rediscovered in the statistical literature by Diggle and 
Kenward (1994). Here the dropout at each time-point (given that it has not 
yet occurred) is modeled using a logistic (or probit) regression model with pre- 
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1.6 Latent responses 


Latent variables can represent continuous variables underlying observed hoars¬ 
ened 5 responses such as dichotomous or ordinal responses (see Section 2.4). 
The latent response interpretation of dichotomous responses was introduced 
by Pearson (1901) for normally distributed latent responses. Although most 
commonly used for such probit models, the latent response formulation is just 
as applicable for logit and complementary log-log models. In the dichotomous 
case，the observed response yi is modeled as resulting from combining a re¬ 
gression model for an underlying continuous response y* 


Vi = 却 + e h 


with a threshold model 



This model corresponds to a probit model if is standard normal, a logit 
model if ei has a logistic distribution and a complementary log-log model if €i 
has a Gumbel distribution. A general interpretation of a latent response is the 
‘propensity’ to have a positive observed response yi = l. In genetics, the latent 
response is interpreted as the ‘liability’ to develop a qualitative trait or phe¬ 
notype such as diabetes type I (e.g. Falconer, 1981). Heckman (1978) modeled 
whether or not American states had introduced fair-employment legislation 
and described the corresponding latent response as the 4 sentiment 5 favoring 
fair-employment legislation. In toxicology, a unit’s 4 tolerance, to a drug is the 
maximum dose the unit can tolerate, so that exceeding the dose results in 
death (e.g. Finney, 1971). 

In the decision context where individuals choose the most preferred alter¬ 
native or rank alternatives in order of preference, the latent responses can 
be interpreted as utility differences. For example, consider the scenario that 
commuters must choose between car or bus. The utilities u^ aT and for 
car and bus may depend on the respective travelling times for the commuter 
and other covariates. The commuter decides to travel by car if u^ ar > or 
alternatively, if y* = u^ T — -u^ us > 0. Models for such comparative responses 
and several applications are presented in Chapter 13. 

Not surprisingly, the introduction of latent responses attracted criticism. 
For instance, Yule (1912, p.611-612) remarked: 

“•••all those who have died of smallpox are equally dead: no one is more dead or 
less dead than another, and the dead are quite distinct from the survivors.” 
with the response by Pearson and Heron (1913, p.159): 

“•••if Mr Yule’s views are accepted, irreparable damage will be done to the growth 
of modern statistical theory.” 

Leaving this philosophical (and personal?) debate aside, the latent response 
formulation of probit models is undoubtedly useful regardless of whether the 
latent response can be given a real meaning. For example, in the latent re¬ 
sponse formulation, we can specify models for dependence between dichoto- 
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mous variables by simply allowing the latent responses to be correlated ( 4 tetra- 
choric correlations’). Similarly, we can model the dependence between dichoto¬ 
mous and continuous variables by allowing latent responses to be correlated 
with observed continuous responses ( 4 biserial correlations 5 ), as in selection 
models (e.g. Heckman, 1979). We refer to Section 4.8.2 for details. 

Several estimation methods are furthermore based on the latent response 
formulation (see Chapter 6). This includes the limited information methods 
suggested by Muthen (e.g. Muthen, 1984; Muthen and Satorra, 1995) and 
discussed in Section 6.7 and the EM algorithms discussed by Schoenberg 
(1985). Some estimation methods simulate latent responses. Examples include 
the Monte Carlo EM algorithms discussed in Section 6.4.1 (e.g. Meng and 
Schilling, 1996), Markov Chain Monte Carlo methods of the kind exemplified 
in Section 6.11.5 (e.g. Albert and Chib, 1993), the method of simulated mo¬ 
ments (e.g. McFadden, 1989), and the GHK method for simulated maximum 
likelihood (e.g. Train, 2003) discussed in Section 6.3.3. 

Moreover, it is much easier to investigate model identification using the 
latent response formulation as we will see in Chapter 5. Finally, procedures 
for model diagnostics based on latent responses have been developed by for 
instance Chesher and Irish (1987) and Gourieroux et al. (1987a) in the fre- 
quentist setting and by Albert and Chib (1995) in Bayesian modeling (see 
Section 8.6). 

1.7 Generating flexible distributions 

Latent variables are useful for generating distributions with the desired vari¬ 
ance function and shape, or multivariate distributions with a particular de¬ 
pendence structure. 

As discussed in Section 1.4, overdispersion of counts can be addressed by 
including a random intercept in the model (e.g. Breslow and Clayton, 1993). 
If there are an excess number of zero counts, zero-inflated Poisson (ZIP) mod¬ 
els can be used which are a mixture of a Poisson model and a mass at zero 
(e.g. Lambert, 1992). This kind of model may for instance be useful if the re¬ 
sponse is the number of alcoholic drinks consumed in a week. In this setting a 
zero response could be due to the person being a ‘teetotaller’ (nondrinker) or 
simply a random fluctuation for a drinker. In the ZIP model, the component 
4 membership 5 label (or latent class) can be viewed as a realization of a dis¬ 
crete latent variable. An application of this model for the number of decayed, 
missing or filled teeth is given in Section 11.2. Many other types of mixture 
models have been used to generate flexible distributions, including mixtures 
of normals and mixtures of Poisson distributions, see for example Ever it t and 
Hand (1981), Bohning (2000) and McLachlan and Peel (2000). 

Another use of latent variables is as a parsimonious way of inducing depen¬ 
dence between responses. A typical example is the analysis of longitudinal data 
as discussed in Section 3.6. Latent variables are also used to induce depen¬ 
dence between nonnormal responses where flexible multivariate distributions 
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do not exist. For example, Coull and Agresti (2000) use correlated random 
intercepts in the linear predictors of logistic regression models to model the 
dependence between dichotomous responses. They refer to these models as 
binomial logit-normal models (BLN). Latent variables are also often used to 
induce dependence between different processes. A common example is joint 
modeling of a response of interest and a missing data process, in so-called 
4 shared random effects’ models (e.g. Wu and Carroll, 1988). This idea is used 
for endogenous treatment modeling in Section 14.5 and joint modeling of re¬ 
peated measurements and survival in Section 14.6. 


1.8 Combining information about individual units from different 
sources 

It is often of interest to assign values to latent variables, taking the form of 
scores for continuous latent variables and categories and classes in the cat¬ 
egorical case. Such prediction, scoring or classification is important for all 
types of latent variables discussed. For instance, true classification of cate¬ 
gorical variables measured with error (or misclassified) is crucial in medical 
diagnosis, and is performed in Section 9.3 for myocardial infarction. Scoring of 
hypothetical constructs such as ability is central in education as illustrated in 
Section 9.4, whereas prediction of unit-specific effects is the purpose of small 
area estimation and disease mapping as discussed in Section 11.4. 

It is beneficial to base scoring and classification on explicit latent vari¬ 
able models. This approach allows empirical assessment of the reliability and 
validity of the measurements and provides optimal means of combining infor¬ 
mation. As explained in Chapter 7, predictions for a unit are not solely based 
on the measurement for that unit, but are also influenced by the estimated 
distribution of the latent variables for the population of units. In the empirical 
Bayesian approach, the latent variable distribution represents the (empirical) 
prior, whereas the conditional distribution of the measurements given the 
latent variables represents the 4 likelihood’. Since the latent variable distribu¬ 
tion is estimated using information from all units, prediction for a given unit 
‘borrows strength 5 from the measurements of other units (e.g. Rubin, 1983; 
Morris, 1983). For instance, in disease mapping adjacent geographical units 
provide useful information improving estimates of disease rates that are based 
on small numbers of events. 


1.9 Summary 

We have demonstrated that latent variables pervade modern statistics and 
described how they are used to represent widely different phenomena such 
as true variables measured with error, hypothetical constructs, unobserved 
heterogeneity, missing data, counterfactuals and latent responses underlying 
categorical variables. Latent variables can also be used to generate flexible 
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multivariate distributions and to combine information about individual units 
from different sources. 

In the next chapter we describe the class of generalized linear models which, 
although not including latent variables, provide an essential stepping stone to 
the latent variable models that are the core of the rest of the book. 
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CHAPTER 2 


Modeling different response processes 


2.1 Introduction 

In this chapter we describe a wide range of response processes, producing the 
following types of observed responses: 

• Continuous or metric 

• Dichotomous 
參 Grouped 

• Censored 

• Ordinal 

• Unordered polytomous or nominal 

• Pairwise comparisons 

• Rankings or permutations 

• Counts 

• Durations or survival 

The aim of statistical modeling is to capture the main features of the empir¬ 
ical process under investigation (see Section 8.2 for further discussion). Typi¬ 
cally, a first simplifying step is to focus on a restricted set of response variables 
and to consider the data generating process of these variables given a set of 
explanatory variables. Univariate models have one response variable whereas 
multivariate models have several, possibly including intervening or interme¬ 
diate variables serving as both response and explanatory variables. The re¬ 
sponse variables are sometimes called ‘dependent’ ， ‘endogenous’ or ‘outcome’ 
variables whereas the explanatory variables are called ‘independent’ ， Exoge¬ 
nous 5 or 4 predictor’ variables. The explanatory variables of primary interest 
are sometimes called ‘exposures’ or 4 (risk) factors’ and the others 4 confounders ， 
or ‘covariates’. However, the term 4 covariate’ is often used as a generic term 
for explanatory variable. 

The variables can be further classified according to their ‘measurement lev¬ 
els 5 . For explanatory variables it is usually sufficient to distinguish between 
continuous or categorical variables since we do not model these variables but 
merely condition on them. If the values of a variable are ordered and dif¬ 
ferences between values are meaningful, the variable is typically treated as 
continuous, otherwise as categorical. 

For response variables, we must also consider the process that may have 
generated the response since this is crucial for formulating an appropriate 
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statistical model. For example, a variable taking on ordered discrete values 
1,2,3, etc. could represent an ordinal response such as the level of pain (none, 
mild, moderate, etc.), a count such as the number of headache-free days in a 
week, or a discrete-time duration such as the number of months from diagnosis 
of some condition to death. The response processes generating these same 
values are obviously of a widely different nature and require different statistical 
models. 

There are two general approaches to modeling response processes. In statis¬ 
tics and biostatistics, the most common approach is generalized linear mod¬ 
eling, whereas a latent response formulation is popular in econometrics and 
psychometrics. Although very different in appearance, the approaches can gen¬ 
erate equivalent models for many response types. However, as we will see in 
later chapters, the choice of formulation can have implications for estimation 
and identification. The latent response formulation is useful even for applica¬ 
tions where interpretation in terms of a latent response appears contrived. 

We start by describing generalized linear models and their extensions. We 
then introduce the latent response formulation and point out correspondences 
between approaches. Finally, durations or survival data are discussed sepa¬ 
rately because they do not fit entirely into either of the frameworks. Both 
continuous and discrete time models are considered. 

In this chapter we do not yet introduce random coefficients or common 
factors. However, the models discussed represent an essential building block 
for the general model framework to be presented in Chapter 4. 

2.2 Generalized linear models 

2.2.1 Introduction 

In generalized linear models (e.g. Nelder and Wedderburn, 1972) the explana¬ 
tory variables affect the response only through the linear predictor for unit 

i, 

Vi = X-/3, 

where Xi is a vector of explanatory variables and (3 contains the corresponding 
regression parameters. 

Both continuous and categorical explanatory variables can be accommo¬ 
dated. For continuous variables such as age, a single term is often used to 
represent the linear effect of age on the linear predictor. More flexible ways of 
modeling effects include polynomials, splines or other smooth functions. For 
categorical variables such as nationality (e.g. Norwegian, German, other), a 
dummy variable would typically be specified for each category except for a 
reference category, for instance ‘other’. For example, a dummy variable for 
Norwegian equals 1 if the person is Norwegian and 0 otherwise so that the 
corresponding coefficient represents the effect of being Norwegian compared 
with 4 other\ 

The response process is fully described by specifying the conditional proba- 
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bility (density) of yi given the linear predictor. The simplest response process 
is the continuous. A linear regression model 

Vi = + a (2.1) 

is usually specified in this case, where the residuals €i are independently nor¬ 
mally distributed with zero mean and constant variance cr 2 , 

ei^N{0,a 2 ). ( 2 . 2 ) 

Special cases of linear regression models include analysis of variance (ANOVA) 
and analysis of covariance (ANCOVA) models. 

The linear regression model can alternatively be defined by setting the con¬ 
ditional expectation of the response, given the linear predictor equal to 

IM 三 E(yi\vi) = 

and specifying that the yi are independently normally distributed with mean 
\Xi and variance a 2 . 

For dichotomous or binary responses taking on values 0 or 1, the conditional 
probability of response 1, Pr(% = 1|%)，is just the conditional expectation \Xi 
of yi， This can be modeled as a logistic regression 

or a probit regression 

fM = ^i) or $ _1 (/Xi) = 

where 伞 (.）is the standard normal cumulative distribution function. Condi¬ 
tional on Vi, the yi are independently Bernoulli distributed. 

Counts are discrete non-negative integer valued responses (0,1,..). The stan¬ 
dard model for counts is Poisson regression with expectation 

IM = exp(z/i) or = Vi 

and Poisson distribution 

Pr_) = eXP( J )Mf - (2-3) 

Counts have a Poisson distribution if the events being counted for a unit occur 
at a constant rate in continuous time and are mutually independent. 

If a count corresponds to the number of events in a given number n of 4 trials’ 
(or opportunities for an event), the count has a binomial distribution if the 
events for a unit are independent and equally probable. The probability of a 
proportion yi out of n then is 

Pr (沾 | W ) = - 识 )' 

\yi n / 

where the binomial coefficient = n\/[(yin)\(n — y^n)!] is the number of 
ways of choosing yin out of n objects regardless of their ordering. 
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2.2.2 Model structure 

All the models described above have a common structure and represent special 
cases of generalized linear models defined by two components: 

1. The functional relationship between the expectation of the response and 
the linear predictor is 

(J-i = srHW) or g(m) = u h 

where g(-) is a link function. We have already encountered the identity, 
logit, probit and log link for linear, logistic, probit and Poisson regression, 
respectively. These and other common links are given in Table 2.1. 



Table 2.1 

Common links 


Link 

9(") 

9~ l {y) 

range of g -1 ^) 

Identity 



—OO, 00 

Reciprocal 

i/m 

1/" 

— 00,00 

Logarithm 

Mm) 

exp(z/) 

0, oo 

Logit 


exp(z/) 

l+exp(i/) 

0,1 

Probit 


吵） 

0,1 

Scaled probit 

w _1 0) 

电 (y ! o) 

0,1 

Complementary 

log-log 

ln(—ln(l — /j)) 

1 — exp (— exp(^)) 

0,1 


2. The conditional probability distribution of the responses is a member of 
the exponential family with expectation and, possibly, a common scale 
parameter 0， 

f{yi\0i,4>) = exp + 0 )|. (2.4) 

Here, Oi is the canonical or natural parameter, cj) is the scale or dispersion 
parameter and 6 (-) and c(-) are functions depending on the member of the 
exponential family. We have already encountered the normal or Gaussian, 
the Bernoulli, Poisson and binomial distributions. Table 2.2 gives details 
on these and other important members of the exponential family. 
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Table 2.2 Members of the exponential family 


Distribution 

Canonical 

link 

0 ㈧ 

Cumulant 

function 

m 

Dispersion 

parameter 

♦ 

Expectation 

m 

Variance 

挪 ' (e) 

Probability 
or density 

Bernoulli 

ln(/x/(l - /j,)) 

ln(l + exp (汐 )） 

1 

exp(g) 

l+exp(0) 

M(1 - M) 


Binomial^ 

ln(/x/(l - /j,)) 

ln(l + exp (汐 )） 

l/n 

exp ⑻ 
l+exp(0) 

fj,)/n 

- v)n 

Poisson 

ln(") 

exp(6») 

1 

exp(6>) 


exp (-/a) 

3/! 

Normal 


e 2 /2 

a 2 

0 

CT 2 

75=f ex P _( 2；^ )2 

Gamma 

—1//X 

-ln(-0) 

a -1 

-VO 

^a- 1 

忐⑸ V- 1 -P (_7) 

Inverse Gaussian 

1/M 2 

嫌 26I) 1 / 2 

a 2 

(-2(9) - 1/2 


7=i^ ex p{ifepf} 


y is the proportion of ‘successes’ out of n ‘trials ， 



Assuming that the responses of units are independent, the likelihood for gen¬ 
eralized linear models is 

N 

^ = n ^ 5 

i=l 

where k = f(yi\0i^ 0) is the likelihood contribution from unit 2 = 1,2,... ,iV, 
and the log-likelihood becomes 

N 

名 = $>， 
i=l 

where & 三 In 匕 . The first and second derivatives of the log-likelihood contri¬ 
butions with respect to 6i are 

|| = [yi-b'm/4>, (2.5) 

and 

Pp-D. 

-qqT = ( 2 . 6 ) 

where b r (6i) and b r, (9i) are the first and second derivatives of 6(*) evaluated 
at 6i ，Maximum likelihood estimation of generalized linear models using iter¬ 
atively reweighted least squares is described in Section 6.8.1. 


2.2.3 Mean function and choice of link function 

From standard likelihood theory the expected scores are zero, 

E (S=。’ 

so it follows from (2.5) that 

Mi = 

Writing 6i as a function of gives the canonical link function 
= ❹ 

The canonical link has convenient statistical properties. However, the choice 
of link should be guided by theoretical considerations and model fit. One 
consideration in choosing a link function is the range of values it generates 
for the mean \Xi = when —oo <Vi < oo (see Table 2.1). For example, 

the logit and probit links are popular for dichotomous responses because they 
restrict the probability to lie in the permissible interval [0,1]. In contrast, 
use of the identity link may in this case lead to predicted probabilities that 
are negative or larger than one. 

Another important consideration relates to the interpretation of the regres¬ 
sion parameters. Since ^ = x^/3, using an identity link corresponds to additive 
effects of the covariates on the mean, \ii = x^/3, and a log link corresponds 
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to multiplicative effects, /^ = exp(x 《 /3)• Using a logit link for dichotomous re¬ 
sponses gives a multiplicative model for the odds, \Xi— = exp(x^/3). This 

link is particularly useful in case-control studies since odds-ratios are invariant 
with respect to retrospective or ‘choice-based’ sampling (e.g. Farewell, 1979). 

Use of the identity link for dichotomous responses has been advocated in epi¬ 
demiology, motivated by a particular notion of causality (see Skrondal (2003) 
and the references therein). This illustrates that there are sometimes reasons 
for departing from the canonical links of the exponential family. 

2.2.4 Variance function and choice of distribution 

The choice of distribution depends on the type of response variable, the process 
that may have generated the response variable, and the shape of the empirical 
distribution. For binary responses, the obvious choice is the Bernoulli distri¬ 
bution. Counts can be shown to have a Poisson distribution if the process 
generating the events has certain characteristics (constant incidence rate and 
independence). 

The choice of distribution also determines the conditional variance of the 
responses as a function of the mean. It follows from standard likelihood theory 
that 



and substitution of terms from (2.5) and (2.6) gives 

Var ( 幽） = 抑 ,, ⑹ = ， ㈤ ， 

where V (^) is known as the variance function and (j) is the dispersion param¬ 
eter. For example, the variance equals the mean for the Poisson distribution 
whereas the variance is given by a constant parameter (j) = a 2 for the normal 
distribution (see Table 2.2). 

2.3 Extensions of generalized linear models 

2.3.1 Modeling underdispersion and overdispersion 

Counts are typically modeled by the binomial or Poisson distribution (the 
binomial if the event can occur only at a predetermined number of ‘trials’ ， n). 
For both distributions the conditional variance, given the explanatory vari¬ 
ables, is determined by the mean. However, the conditional variance observed 
in practice is often larger or smaller than that implied by the model, phenom¬ 
ena known as overdispersion or underdispersion ， respectively. Overdispersion 
could be due to variability in the binomial probabilities or Poisson rates not 
fully accounted for by the included covariates and is more common than un- 
derdispersion. 

An ad hoc solution to the problems of overdispersion and underdispersion 
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is to introduce an extra proportionality parameter (jf for the variance, giving 
Var (扒 | 内 ） = 

for the binomial and 

Var^l^) = 

for the Poisson distribution. Note that these specifications need not correspond 
to probability models but can nevertheless be estimated using so-called quasi- 
likelihood methods (see Section 6.8). 

An alternative ‘proper’ modeling approach to overdispersion is to allow the 
mean to vary randomly between units for fixed covariate values. Combining 
the binomial response distribution with a beta distribution for the probabili¬ 
ties gives the beta-binomial distribution. The negative binomial distribution 
results from combining the Poisson response distribution with a gamma dis¬ 
tribution for the rate. Another possibility is to include a normally distributed 
random intercept in the linear predictor, a special case of the general model 
framework to be presented in Chapter 4. Lindsey (1999, p.197-220) discusses 
these and further methods for modeling overdispersion and under dispersion. 

2.3.2 Modeling heteroscedasticity 

A classical assumption in linear regression models is homoscedasticity, i.e. the 
residual standard deviation a is assumed to be constant over units. However, 
a may depend on categorical or continuous covariates. For example, when 
comparing the heights of boys and girls aged 11， we would expect the girls’ 
heights to be more variable because many of the girls would have entered 
puberty while (nearly) all the boys would be prepubertal. Since the standard 
deviation must be positive, it is convenient to model heteroscedasticity using 
a log link, 

In Gi = or Gi = exp(x^t), (2.7) 

where are covariates and i parameters. Such ‘multiplicative heteroscedas- 
ticity’ was suggested by Harvey (1976). This specification can also be used for 
other models with scale or dispersion parameters; see also page 31. 


2.3.3 Models for polytomous responses 

Polytomous responses are unordered categorical responses such as political 
party voted for in an election. In econometrics such responses are often re¬ 
ferred to as discrete choices. Terms used in statistics include qualitative or 
nominal responses since the categories are not quantitative and have no in¬ 
herent ordering. Other terms include quantal, polychotomous or multinomial 
responses. 

A separate linear predictor is specified for each category a 8 , 5 = 1,..., 5. 
In this respect, the response can be viewed as multivariate and is sometimes 
represented as a vector having a one for the realized category and zeros for 
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the other categories (e.g. Fahrmeir and Tutz, 2001). The probability of the 
sth category or alternative a s is typically modeled as a multinomial logit 


Pr(2/i = a s )= 


exp«) 

Ef=iexp(^l) 


(2-8) 


where is the linear predictor for unit i and category a s and the sum is over 
all S categories. Thissen and Steinberg (1986) refer to models of this type as 
divide-by-total models. Note that Pr(^ = a s ) is a conditional probability given 
the linear predictors, although we have suppressed the conditioning here and 
in the remainder of the chapter for simplicity. 

We can include unit and category-specific covariates or attributes in the 
linear predictor. For instance, consider the case where the response categories 
are supermarkets and a customer’s response represents his choice of super¬ 
market a s . The linear predictors could include customer specific variables 
such as income as well as customer and supermarket specific variables x| such 
as travelling time to the supermarket. The linear predictor becomes 


vt = m^+x^+xf^, (2.9) 


where m s is a category-specific constant, /3 s are category-specific effects of 
unit-specific covariates and (3 are constant effects of unit and category- 
specific covariates xf • The coefficients of xf could also differ between categories 
if for example the effect of travelling time is greater for small supermarkets 
than for large ones. 

Note that adding a term Bi to the linear predictors for unit vf，s = 
1,..., 5 does not change the probability in (2.8) since it amounts to multiply¬ 
ing both numerator and denominator by exp(^). For this reason, we could 
add constants to m s and (3 s without changing the model. This is an example 
of an identification problem (see also Chapter 5). The problem can be over¬ 
come by taking one category as 4 base category’ (typically the first, ai) and 
imposing m 1 = 0 and /3 1 = 0. 

The conditional logit model, standard in econometrics, arises as the special 
case where there are no unit-specific covariates and no constants m 8 (e.g. 
McFadden, 1973). The polytomous logistic regression models a standard model 
in for instance biostatistics (e.g. Hosmer and Lemeshow, 2000), results as the 
special case where there are no category-specific covariates xf. 


2.3.4 Models for ordinal responses 


Cumulative models 

Models for ordered categorical or ordinal responses can be defined by linking 
the cumulative probability Pr(yi <a s ) to the linear predictor (e.g. McCullagh, 
1980), 

g[^{Vi < d s )] = k s - 5= (2.10) 

where ai < a2 <... < are ordered response categories, Pr(t/^ < 05) = 1 and k s 
are threshold parameters, < 托 5-i. Typical choices of link function 
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include the probit, logit and complementary log-log. Such models are often 
called cumulative models or graded response models (Samejima, 1969). 

Consider the right-hand side of (2.10), k s — v^. Since adding an arbitrary 
constant /3o to the linear predictor can be counteracted by adding the same 
constant to each k s , it is clear that we cannot simultaneously estimate the 
constant and all thresholds. This identification problem can be overcome by 
setting «i = 0, making the parametrization identical to that used for dichoto¬ 
mous response models in the previous section (if 5 = 2, ai = 0 and = 1). 
Alternatively, could be a model parameter if we instead omit the constant 
from the linear predictor. 

It follows from (2.10) that the probability of a particular response y s be¬ 
comes 

Pr(yi = a s ) = Pr(^ < a s ) - Pr(^ < a s _i), (2.11) 

prompting Thissen and Steinberg (1986) to refer to these models as difference 
models. 

The effects of the covariates on the cumulative response probabilities in (2.10) 
are constant across categories 5， a feature called the parallel regression as¬ 
sumption. With a logit link, the odds of y exceeding a s become 


Pr(yj > a s ) 
Pr(yi < a s ) 


- Pr(^ < a s ) 


= exp(x-/3 - k s ). 


Pr(yi < a s ) 

The ratio of these odds for two units i and i’ ， exp[(x 《一 x^) 7 /?], is the same 
for all 5, a property known as proportional odds. A useful feature of cumu¬ 
lative models is that the estimated regression parameters are approximately 
invariant to merging of the categories. 

The assumption of constant effects of the covariates across response cate¬ 
gories can be relaxed by allowing the thresholds to depend on covariates X 2 《 
(e.g. Terza, 1985) 

^is = 


where is a parameter vector. Considering the probit version, the model then 
becomes 

P(yi < a s ) = ^(K si - Vi) = - x^/3), (2.12) 

where x 2 ^ are covariates with category-specific effects and are covariates 
with constant effects. It is clear that the coefficients of any variables included in 
both X 2 i and are not separately identified. A problem with this parametriza¬ 
tion is that for some covariate values the thresholds will not necessarily satisfy 
the order constraint < ^2 < ••- < 托 i 5 -i, so the probabilities in (2.11) are 
not constrained to be nonnegative. The order constraint can be imposed by 
using the parametrization 

= 0? ^is = /^is—1 H - 6Xp(x2^<s S ), 5=2, . " ， S ， 


see for example Fahrmeir and Tutz (2001). 

An alternative device for relaxing the parallel regression assumption is to 
use a scaled ordinal probit link in which the scale parameter is modeled as 
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has the same form as that of the standard multinomial logit model with cat¬ 
egory specific covariates x| = sx— Goodman (1983) referred to this model 
as the parallel odds model, whereas the model is known as the partial credit 
model (Masters, 1982) in item response theory (see Section 3.3.4). 

Instead of assigning equally spaced scores to a s , we may specify scores re¬ 
flecting the ‘distances’ between the ordered categories. The stereotype model 
can be thought of as a generalization of this model where the scores are es¬ 
timated instead of fixed. Models in which scores are assigned to both the 
a s and to an ordered explanatory variable correspond to the linear by linear 
association model (e.g. Goodman, 1979; Agresti, 2002). 


Continuation ratio logit model 


Another possibility is to assume proportionality of the odds of exceeding cat¬ 
egory a s given that yi is at least equal to a s {yi > a s ), 

Pr(^>q s ) 

Fr(yi = a s ) 


: exp(m s + x^/3), 


interpretable as the odds of continuing beyond 4 stage’ a s versus stopping at 
that stage. An equivalent model is not obtained from reversing the ordering of 
the categories, suggesting that the model should only be used for sequential 
stages. Since this continuation ratio logit model is often used for discrete-time 
durations (sequential stages), we return to it in Section 2.5.2. 


2.3.5 Composite links 
Composite links are of the form 

— c ij 9j 1 ( z/ j)j (2.14) 

3 

where c^- are known constants. Such link functions are useful for modeling 
count data where some observed counts represent the sums of counts for dif¬ 
ferent covariate values, typically due to missing covariate information. 

A famous example is the blood-type problem where the offspring inherits 
blood-type (phenotype) A if the mother and father contribute genes A and O 
in any of the combinations AA，AO or OA. If the gene frequencies for A and 
O are p and r, respectively, the expected frequency of blood-type A is 

N(p 2 -\-2pr) = exp (In iV + 21n p) H- exp (In iV + In p+ In r ) + exp (In TV + In p +In r ), 

where the three terms on the right-hand side correspond to the expected 
frequencies for genotypes AA, AO and OA. In this example, i in (2.14) indexes 
the observed phenotypes whereas j indexes the unobserved genotypes. The 
coefficients are 1 for all genotypes j consistent with phenotype i and 0 
otherwise, g~ x is the exponential function, the log of the total number of 
units In TV is used as an offset (a covariate with coefficient set to 1) and Inp 
and lnr are model parameters. 
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Another example of a composite link is the probability of a response cate¬ 
gory in cumulative models for ordinal responses, which is equal to a difference 
between cumulative probabilities (see equation (2.11)). In Section 9.4 we will 
use a composite link to specify a three-parameter item response model. We re¬ 
fer to Thompson and Baker (1981) and Rindskopf (1992) for further examples 
of the use of composite links. 

2.4 Latent response formulation 

A response variable yi can often be viewed as a partial observation or coars¬ 
ening of a continuous latent response y* (e.g. Pearson, 1901). Let the latent 
response be modeled as 


Vi =fi + Ej, 


where Vi is a linear predictor and 6i an error term or disturbance. For continu¬ 
ous responses the latent response simply equals the observed response; yi = y*- 
Other response types arise when the latent response is coarsened by applying 
different kinds of threshold functions described in the following subsections. 

2-4-1 Grouped, interval-censored, ordinal and dichotomous responses 

The observed response yi takes on one of S response categories a 8 , s = 
1,..., and the relationship between observed and latent response can be 
written as 



where = —oo and K，is = oo. For S = 3 this is illustrated in Figure 2.1 for 
normally distributed e^. 

For grouped responses the thresholds do not vary between units, 
and are known a priori. An example of grouped data are salaries grouped into 
prespecified income brackets with boundaries a situation considered by 
Stewart (1983). 

For interval-censored responses the Ki S vary between units and are known 
a priori. For example, time of onset of an illness may not be known exactly 
but only to lie within a censoring interval between two clinic visits, with the 
timing of visits varying between individuals. 

For ordinal responses the thresholds k s are unknown parameters and usu¬ 
ally do not vary between units. For example, severity of pain may be described 
as c none’ ， ‘moderate’ or 4 severe’. These outcomes may literally be considered 
as resulting from pain severity，an unobserved continuous latent response, ex¬ 
ceeding certain thresholds. Sometimes we can relax the assumption of constant 
thresholds to model individual differences in pain tolerance. 

Dichotomous responses can often be viewed as ordinal responses with 2 
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It follows that 

= Pr(yi = l\vi) = Pr(y* > 0| 蜂 = Pr(^ + > 0) 

=Pr(q > -Vi) = Pr(ei < = F{vi), 

where the penultimate equality hinges on the symmetry of the density of 
Here F=g~ x is the cumulative distribution function, the standard normal for 
the probit, the logistic for the logit, and the Gumbel for the complimentary 
log-log link. 


Censored responses 

The threshold model for doubly censored responses can be written as 

{ «ii if V\ < 

y\ if I^n <yt < (2.16) 

if I^i 2 <yt. 

For the special cases of right-censored responses, Kn = —oo, and for left- 
censored responses, Ki 2 = oo. 

Different types of censored responses are prominent in duration or survival 
analysis. Right-censoring is typically due to the event not having occurred 
by the end of the observation period. Left-censoring occurs if all we know 
is that the event had already happened before observation began. If both 
types of censoring can occur, the responses are doubly censored. Other exam¬ 
ples of censoring are ceiling and floor effects. For example, when measuring 
ability using the percentage of correctly solved problems we cannot differen¬ 
tiate between candidates achieving 100%. All we know is that their ability is 
greater than or equal to that required to achieve the maximum score (ceiling 
effect). Analogously, a floor effect occurs if some candidates cannot solve any 
problems. For normal latent responses (conditional on covariates), one-sided 
censoring was introduced by Tobin (1958) and is hence denoted the Tobit. The 
analogous model for double censoring is due to Rosett and Nelson (1975) and 
is often denoted the two-limit probit. 

Censoring should not be confused with truncation. Left-truncation occurs if 
units with a response below a certain threshold are excluded from the sample. 
An example is a clinical trial for treatment of hypertension where baseline 
blood-pressure must exceed a threshold for inclusion into the study. In the 
duration literature, left-truncation due to a delay between becoming at risk 
and being included in a study (so that units with short durations are excluded) 
is often called late entry or delayed entry. Right-truncation occurs if only those 
units are included in the sample whose response falls below some threshold. 
A classical example from econometrics are negative income tax experiments 
where families with income levels above a certain limit, for instance 1.5 times 
the poverty line, are excluded from the study of earnings. Unlike censoring, 
truncation means that we have no information on the explanatory variables 
of those individuals whose response is beyond the threshold. 
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The different coarsening processes leading to different variable types are 
summarized in Table 2.3. The columns specify the type of coarsening and 


Table 2.3 Response types and types of coarsening 


Type of 
variable 

Coarsening 

Threshold (s) 

Vi Vi ^ K'is—l Hi ^ l^is—1 ^ Vi ^ ^is 

/vary. 

known 

/unkn. 

Continuous 

V 



Grouped 

V 

const. 

known 

Interval-cens. 

V 

vary. 

known 

Ordinal/dich. 

V 

const. 

unkn. 

Right-cens. 

V V 

vary. 

known 

Left-cens. 

V V 

vary. 

known 

Doubly cens. 

V V V 

vary. 

known 


the type of threshold. The thresholds can either vary or be constant across 
units and either be known or unknown model parameters. The resulting types 
of variable are given in the rows. Apart from continuous and ordinal vari¬ 
ables, the variable types are often referred to as limited dependent variables 
in econometrics. 


2.^.3 Comparative responses 

In Section 2.3.3 we considered models for polytomous responses. Polytomous 
responses can be construed as comparative in the sense that the realized cat¬ 
egory dominates 5 the others. 

This interpretation is particularly apt for first choice or discrete choice data 
where decision makers choose from a set of alternatives (categories). For in¬ 
stance, in election studies a central outcome variable is the first choice of a 
voter, say Conservatives, among a set of alternatives (say Labour, Conserva¬ 
tives and Liberals). 

Another type of comparative response is pairwise comparisons where the 
responses are the dominant categories in each pair of categories for a unit. 
For instance, Labour could be preferred to Liberal in the first pair, Liberal 
preferred to Conservatives in the second pair, etc. 

A permutation of categories is also a comparative response. In the decision 
context permutations can be interpreted as rankings, where alternatives are 
ordered according to preference. Political parties may for instance be ranked, 
say Liberals preferred to Labour preferred to Conservatives. In contrast to 
pairwise comparison data, the pairwise comparisons implied by ranking data 
are necessarily transitive (Liberals preferred to Conservatives follows from 
Liberals preferred to Labour and Labour preferred to Conservatives). Com¬ 
parative responses are nominal in the sense that the categories do not possess 
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an inherent ordering shared by all units as is assumed for ordinal variables. 
We find it useful to use the decision terminology for comparative responses 
even when decisions are not actually involved. 

Comparative responses can be modeled by assuming that each unit assigns a 
utility uf to each alternative a s . The term utility should be broadly construed 
as popularity or attractiveness of alternatives. For polytomous responses, it is 
assumed that the alternative with the greatest utility is chosen, i.e. 

yi = a s ii — ul> 0 \/t, t+s. 

If we model the utilities as 

< = <+<， 

where the linear predictors v\ can take on different values for different alterna¬ 
tives and the e! are independently Gumbel (extreme value Type I) distributed 


Pr(ef <r) = exp [— exp(-r)], (2.17) 

it can be shown (McFadden, 1973; Yellott, 1977) that the probability of a 
particular choice a s is 


Pr(^ = a s ) 


exp(^|) 

Ef=i exp(^|) 


(2-18) 


This is the multinomial logit model introduced in Section 2.3.3. 

The multinomial probit model (e.g. Daganzo, 1979) instead assumes that the 
vector containing the e! has a multivariate normal distribution with variances 
cjg and covariances a; ss /. Consider for simplicity the case of 3 alternatives, 
5 = 3, and define v\ k = v\ — v\ and e\ k = e} — e^. The probability of choosing 
the first alternative ai then becomes 

Pr(2/i = ai) = / f ip(el 2 ,ef) def def, 

J—oo J —oo 

where ^(eP，e| 3 ) is bivariate normal with expectation vector zero and covari¬ 
ance matrix 

^ _ I" + a；2 — 2a；i2 1 

1 _ [ — a ； i3 — a ； i 2 + a；23 ^；i + ^3 — 2a ； i3 J * 

Note that the choice probabilities cannot be expressed in closed form and 
require integration over 5—1 dimensions. On the other hand, the multinomial 
probit allows the utilities to be dependent in contrast to the multinomial logit 
model. It thus relaxes the so-called independence from irrelevant alternatives 
(IIA) property of the latter model (see Section 13.2). 

For pairwise comparisons, let yi S t = 1 if unit i prefers alternative s to alter¬ 
native t and 0 otherwise. Under the Gumbel specification the corresponding 
probability becomes 


Pr ( 細 =1)= 


exp(z/| - v\) 

1 + exp(z/| - v\) 
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where v} is typically set to zero to ensure identification. This model is of¬ 
ten called the Bradley-Terry-Luce model after Bradley and Terry (1952) and 
Luce (1959). The joint probability of a set of pairwise comparisons for unit 
i is often assumed to be the product of these probabilities. This unrealistic 
independence assumption may be relaxed by introducing latent variables. 

Turning to rankings, let rf be the alternative given rank £ among S alter¬ 
natives and let 三 «， rf, …， rf) be the ranking of unit i. The probability 
of a ranking can be construed in terms of a utility ordering and expressed in 
terms of S—l binary utility comparisons, where the utility of the alternative 
ranked first is larger than the utility of that ranked second, which is larger 
than that ranked third and so on. 

For Gumbel distributed random utilities this leads to the logistic model for 
rankings (e.g. Luce, 1959; Plackett, 1975) 

Pr(R ^l^fc- (2 ' 19) 

The model is often denoted the exploded logit (Chapman and Staelin, 1982) 
since the ranking probability is written as a product of first choice probabilities 
for successively remaining alternatives. That such an explosion results was 
proven by Luce and Suppes (1965) and Beggs et al. (1981). The latent response 
perspective reveals that the exploded logit can be derived without making 
the behavioral assumption that the choice process is sequential. Importantly, 
an analogous explosion is not obtained under normally distributed utilities. 
The Gumbel model is not reversible, that is, successive choices starting with 
the worst alternative would lead to a different ranking probability. Another 
essential feature of the model is independence from irrelevant alternatives. 

Different alternative sets for different units, for instance different eligible 
parties in different constituencies, can simply be handled by substituting Si 
for S in the above formulae. Partial rankings result when unit i only ranks a 
subset of the full set of alternatives, for example when experimental designs 
are used in presenting specific subsets of alternatives to different units (e.g. 
Durbin, 1951; Bockenholt, 1992). Such designs are easily handled by letting the 
alternative sets vary over units. Another kind of partial ranking is top-rankings 
where not all alternatives are ranked but only the subset of the Pi < Si most 
preferred alternatives. The probability of a top-ranking is simply the product 
of the first Pi terms in equation (2.19). Note that the first choice probability 
is obtained as the special case of the ranking probability when Pi = l for all i. 

Two or more alternatives are said to be tied if they are given the same 
rank. Although the probability of tied rankings is theoretically zero since the 
utilities are continuous, ties are often observed in practice. As we will see in 
Section 2.5, equation (2.19) has the same form as the partial likelihood of a 
stratum in Cox regression for survival analysis. Exploiting this duality, we can 
utilize methods for handling ties previously suggested in the survival literature 
(see Section 2.5). 
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Assembling comparative response data is the natural approach when study¬ 
ing choice behavior. Comparative responses can also fruitfully replace rating 
or thermometer scales in studying the popularity of objects. Use of scales 
would invoke the unrealistic assumption that individuals use the scale in the 
same way (e.g. Brady, 1989， 1990). However, some subjects tend to use the 
high end of the scale whereas others use the low end. In addition there could 
be differences in the range of ratings used. Alwin and Krosnick (1985) discuss 
pros and cons of ranking and rating designs for the measurement of values. 


2.5 Modeling durations or survival 

The response of interest may be time to some event. In medicine, the archety¬ 
pal example is survival time from the onset of a condition or treatment to 
death. In studies of the reliability of products or components, for instance 
light bulbs, lifetime to failure is often investigated. Instead of using such ap¬ 
plication specific terms, economists usually refer to durations between events. 
Generally, we will adhere to this terminology but occasionally we lapse by 
referring to survival or failure times. 

There are some important distinguishing features of duration data. Du¬ 
rations are always nonnegative and some durations are typically not known 
because the event has not occurred before the end of the observation period 
(right-censoring). Furthermore, the values of covariates may change as time 
elapses and the effect of covariates may change over time. These features imply 
that one cannot simply apply standard models for continuous responses. 

Another common phenomenon in survival analysis is left-truncation or de¬ 
layed entry where units that have already experienced the event of interest 
(e.g. death) when observation begins are excluded from the study (see e.g. 
Keiding, 1992). Note that this phenomenon is called { stock sampling with fol¬ 
low up’ in econometrics since only those in the 4 alive state’ at a given time 
are sampled (e.g. Lancaster, 1990). Left-censoring and right-truncation are less 
common. Left-censoring occurs if the event is only known to have occurred be¬ 
fore a given time-point, for instance when observation begins. Right-truncation 
can occur in retrospective studies, for example when studying the incubation 
period for AIDS in patients who have already developed the disease. 

In this section we assume that censoring and truncation processes are ig- 
norable in the sense that the probability of being censored or truncated is 
independent of the risk of experiencing the event given the covariates. We 
will also confine the discussion to so called absorbing events or states where 
the unit can only experience the event once. Treatment of multiple events, 
such as recurring headaches, is deferred to Chapter 12 since the dependence 
among events for the same unit must be accommodated. For simplicity, we 
only consider single absorbing events and will not explicitly treat competing 
risk models where there are several absorbing states. 

Duration models are usually not defined as generalized linear models. How- 
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ever, it turns out that generalized linear models can often be adapted to yield 
likelihoods that are proportional to those of duration models. 

Durations are either considered in continuous or discrete time, and these 
cases will be discussed in the two subsequent sections. 

2.5.1 Continuous time durations 

Let Ti be a random variable representing the duration for a unit i from be¬ 
coming at risk until it either experiences the event or is censored. The realized 
duration is denoted U, with a corresponding indicator variable Si taking the 
value 1 if the event is experienced by the unit and 0 if the duration is censored. 

The density function for unit i is denoted fi(t) and the cumulative distri¬ 
bution function becomes 

Fi(t) = [ fi(u)du. 

Jo 

The survival function, the probability that duration exceeds t, is then defined 
as 

况⑺三 1_ 巧⑺. 

The hazard, sometimes also called the incidence rate or instantaneous risk, is 
defined as 

_三 Um| Pr(t - ri< 2 +A|Ti -^} • 

Somewhat loosely, this is the 4 risk，of an event at time t for unit i given that 
the event has not yet occurred and that the unit is therefore still 4 at risk’. 

It follows from these definitions that 

_ 普-學 (-0) 

Since 5^(0) = 1， it also follows that 

Si(t) = exp [- / hi(u)du]. 

Jo 

Defining the integrated hazard or cumulative hazard as 

Hi(t) = [ hi(u)du, 

Jo 

unit i’s contribution to the likelihood is 

h = Si(ti)hi(U) 5i = exp[-Hi(ti)]hi(ti) 5i . ( 2 . 21 ) 


Accelerated failure time models 
One approach to specifying a duration model is as 
\nTi = /3q ui + ei, 
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where does not include a constant, or alternatively as 
Ti = exp(/3 0 )exp ㈣ exp(e《). 

We see that the covariates act multiplicatively directly on the time scale, thus 
accelerating or decelerating durations. The survival function of such acceler¬ 
ated failure time models can therefore be written as 
Si{t) = So (exp(^)^), 

where So(t) is the baseline survival function for = 0. Examples of accelerated 
failure time models include the log-normal duration model where Ci is normally 
distributed, the log-logistic duration model with a logistic and the Weibull 

duration model if Ci has a Gumbel distribution, see (2.17) in the previous 

section. 

Proportional hazards models 
The hazards can be modeled as 

hi(t) = h°(t)exp(ui), (2.22) 

where h°(t) is the ‘baseline’ hazard, the hazard when all covariates are zero 
(the linear predictor does not include a constant). Considering two units i and 
i f ， we obtain 

::((?) = exp(ui — Uif), (2.23) 

the hazard functions of any two units are proportional over time. 

The Weibull duration model introduced above is unique in that it possesses 
both the proportional hazards and accelerated failure time properties (e.g. 
Cox and Oakes, 1984). The exponential duration model is the special case of 
the Weibull model where the baseline hazard is constant h°(t) = h°. This 
property is relaxed in the piecewise exponential duration model where the 
baseline hazard function is assumed to be piecewise constant over intervals 5, 
with h°(t) = h s for r s _i < i < r s , s = 1,2,..., 5. 

Interestingly, it turns out that the likelihood of the piecewise exponential 
duration model is proportional to that of a Poisson model when the data are 
expanded appropriately (Holford, 1980). Let 6i = exp(^) so that unit i has 
hazard h s 9i ， For a unit that was censored or failed in the 5th interval its 
contribution to the likelihood becomes (see (2.21)) 

h = (h s 0i) Si exp (- ^2 K0idi r ), (2.24) 

hi(U) v - ^ 

Hiiti) 

where di r is the time unit i spent in interval r, di r = min(^, r r ) — r r _i. This 
can be rewritten as 

k ^ f[{h r di) yir eM-K0id ir ), (2.25) 
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where yi r = 0 for r < s and yi s = <5^. This is proportional to the contribution 
to the likelihood of s independent Poisson variates yi r with means h r Qidi r , 
see (2.3). See also Clayton (1988). 

We represent each unit by a number of observations (or ‘risk sets’) equal 
to the number of time intervals preceding or including that unit’s failure (or 
censoring) time as shown in the upper panel of Display 2.1 on page 43. The 
model may then be fitted by Poisson regression with a log link, using yi r as 
response variable, ln(^ r ) as an offset (a covariate with regression coefficient 
set to 1) and dummies for the time intervals as explanatory variables to allow 
for different piecewise constant hazards h r . Explicitly, 

yir 〜 Poisson(^ r ), 


where 

and 


= Kr* 

Viv = ln(d ir ) H- ]n(h r ) + x-/3. 


Therefore, one approach to survival modeling is to divide the follow-up 
period into intervals over which the hazard can be assumed to be piecewise 
constant and use Poisson regression. Breslow and Day (1987, p.137) show how 
Poisson regression can be implemented for identity and power links. Assuming 
a piecewise linear log hazard corresponds to a piecewise Gompertz distribution 
(e.g. Lillard, 1993). 

Another approach is to define as many intervals as there are unique failure 
times with each interval starting at (just after) a unique failure time and 
ending at (just after) the next largest unique failure time (Holford, 1976; 
Whitehead, 1980; Laird and Olivier, 1981). This corresponds to the famous 
Cox proportional hazards model since a 4 saturated，or nonparametric model 
with a separate constant for each risk set is used for the baseline hazard. 

The data expansion necessary to estimate the Cox model using Poisson 
regression with log link is shown for an artificial dataset in the lower panel of 
Display 2.1. It can be shown that the corresponding likelihood is proportional 
to the partial likelihood of Cox regression 


lp TT exp(t/ (r) ) 

P _ 丄 /Ei 哪 ⑺) exp ㈣ 


where z/( r ) is the linear predictor for the unit that failed at the rth ordered 
failure time and R{t( r )) is the risk set for the rth failure time (see Dis¬ 
play 2.1 where the unit that fails and contributes to the numerator is enclosed 
in a box). This partial likelihood can be derived by eliminating the baseline 
hazard using a profile likelihood approach (Johansen, 1983). Note that the 
partial likelihood is equivalent to the conditional likelihood of logistic regres¬ 
sion with pair-specific intercepts that is widely used for matched case-control 
studies (e.g. Breslow and Day, 1980). 

Instead of allowing the baseline hazard to take on a different value for 
each failure as in Cox regression, the baseline hazard could be modeled as a 
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Display 2.1 Continuous durations: original and expanded data 

Piecewise exponential model: 
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smooth function of time using the Poisson regression approach. A polynomial 
in time could be used for the log hazard, or alternatively splines or fractional 
polynomials. The estimated function can be interpreted as the baseline hazard 
function only if ln(di r ) is included as an offset, a point ignored in Goldstein 
(2003, Chapter 10). Note that although the hazard is modeled as a smooth 
function of the failure times t^, it is constant between adjacent failure times. 

Tied or identical durations often arise in practice although inconsistent with 
continuous time duration models. The appropriate way of handling ties is to 
sum the likelihood contributions for all possible permutations of the tied dura¬ 
tions (e.g. Kalbfleisch and Prentice, 2002)，but this can become very involved. 
Thus, approximate methods have been suggested, the most commonly used 
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being the Peto-Breslow method (Peto, 1972; Breslow ， 1974). This method 
amounts to assuming that all tied units are still at risk when any of the units 
fail. The Peto-Breslow method appears to work well as long as the number of 
ties is moderate (Farewell and Prentice, 1980). A better approximation has 
been proposed by Efron (1977), where the contribution of the tied units in the 
denominator is successively downweighted to reflect the decreasing risk sets. 
When there are many ties, it may be more appropriate to treat durations as 
discrete rather than continuous. 


2.5.2 Discrete time durations 

Discrete time duration data most commonly arise from interval-censoring of 
processes in continuous time. Another source are discrete time processes where 
events can only occur at discrete time points, for instance durations of party 
loyalty in terms of number of elections. In either case, it can be useful to model 
discrete durations as interval-censored. Let r s be the censoring limits so that 
all we know is that 

丁 s—i S Ti 〈 t 8 . 

In addition, the discrete survival time may be right-censored, 

丁 s-1 < Ti. 


Proportional odds model 


The proportional odds model introduced in Section 2.3.4 for ordinal responses 
can also be used for modeling discrete time durations. The probability that 
Ti is less than r s becomes 

exp(x-/3 + k s ) 


Pr(Ti < r s ) : 


1 + exp(x^/3 + k s )' 


and the probability that the survival time lies in the kth. interval r s _i <Ti < 
r s is 

Pr(r s _i <Ti< r s ) = Pr(Ti < r s ) - Pr(Ti < r s _i), 
with Pr(T^ < to) = 0. It should be noted that the present /3 have opposite 
signs to the (3 in Section 2.3.4 so that large coefficients imply increased risk. 
In the absence of right-censoring, this is the likelihood contribution of all 
observations whose survival times lie in the 5th interval. For observations that 
are censored after the 5th interval, the likelihood contribution is Pr(T^ > 
r s ) = 1—Pr(T^ < r s ). The proportional odds model has been used for discrete 
survival time data by Bennett (1983), Han and Hausman (1990), Ezzet and 
Whitehead (1991) and Hedeker et al. (2000). 

The proportional odds model can also be given a latent response interpre¬ 
tation (see Section 2.4 )， 

y; = (2.26) 

where €i has a logistic distribution. If a standard normal distribution is as¬ 
sumed for €i, the ordinal probit model is obtained. The event occurs in the 
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5th interval if « s _i < y* < « s , i.e .， 

Pr(r s _i <Ti< r s ) = Pr(« s _i < y* < k s ). 

The latent response y\ can therefore be thought of as a monotonic transforma¬ 
tion of Ti so that y* = k s corresponds to Ti = r s . By constraining the threshold 
parameters k s to be equally spaced so that the transformation from Ti to y* is 
linear, the appropriateness of a linear regression model for Ti can be assessed. 


Models based on the discrete time hazard 


The discrete time hazard for the sth interval is defined as the probability that 
the event occurs in the 5th interval given that it has not already occurred, 


hi{s) = Pr(r s _i <Ti< r s \Ti > r s _i) = 


Pr(r s -i <Tj< t s ) 
Pr(Ti > r s _i) 


The likelihood contribution of a unit whose survival time lies in the 5th interval 
is 


k = hi(s) n[l-/i,(r)] = 加 [l-/ii(r)] (1 - 奸） witht/ is = 1. (2.27) 


Here, yi r is an indicator variable that is equal to 1 if the event occurred in the 
rth interval and equal to 0 otherwise, i.e. yi = 0 when r < s and yi = l when 
r = 5. The likelihood contribution of a unit who was censored after the fcth 
interval has the same form, 


U = J][l - h t (r}\ = 严 with y is = 0. (2.28) 


The likelihood contributions of both censored and noncensored observations 
are just the likelihood contributions of s independent binary responses yi r , r = 
1, … ，5 with Bernoulli probabilities hi(r). Therefore, for a unit that either fails 
or is censored in the 5th interval we expand the data to s records and construct 
the indicator variable yi r as shown in Display 2.2 on page 46. Discrete time 
survival models can then be written as generalized linear models for binary 
responses. 

One possibility is to use logistic regression with a separate constant for each 
interval, 

In H ^ +Kr- (2.29) 

1 - hi(r) 

This model, proposed by Cox (1972), is often referred to as a proportional odds 
model. However, whereas proportionality here applies to the conditional odds 
of the event happening in an interval given that it has not already happened, 
proportionality in the proportional odds model presented in the previous sec¬ 
tion applied to the odds of the event happening in a given interval or earlier. 
The above logistic model for discrete time survival data is equivalent to the 
continuation ratio logit model introduced in Section 2.3.4 except for the sign 
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Display 2.2 Discrete time durations: original and expanded data 
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of the linear predictor. Continuation ratio models are useful for sequential 
processes in which stages (such as educational attainment levels) cannot be 
skipped and interest focuses on the odds of (not) continuing beyond a stage 
given that the stage has been reached. See Jenkins (1995) and Singer and 
Willett (1993) for introductions to this model. 

If a Cox proportional hazards model is assumed for the unobserved con¬ 
tinuous survival times and the observed discrete survival times are treated as 
interval-censored, it can be shown that the likelihood contributions are equal 
to those in (2.27) and (2.28) if a complementary log-log link is used for the 
discrete time hazard (e.g. Thompson, 1977), i.e. 

ln{-ln[l -/ii(r)]} = ^ + k t . (2.30) 


2.6 Summary and further reading 

We have introduced a wide range of response processes, including continuous, 
dichotomous, grouped, censored, ordinal and comparative responses, as well 
as counts and durations in discrete and continuous time. Most of the processes 
can more or less directly be expressed as generalized linear models, and many 
as latent response models. The models for the response processes will serve 
as building blocks for the more general models introduced later. Furthermore, 
the application chapters 9 to 14 are structured according to type of response 
process. 

Useful books on generalized linear models include McCullagh and Nelder 
(1989), Aitkin et al. (1989, 2004) and Fahrmeir and Tutz (2001). 
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For categorical responses we recommend Long (1997), Agresti (1996, 2002 )， 
Collett (2002) and Andersen (1980). Ordinal responses are discussed in Clogg 
and Shihadeh (1994) and Johnson and Albert (1999), counts in Cameron and 
Trivedi (1998) and Winkelmann (2003). Maddala (1983) considers categor¬ 
ical responses as well as limited-dependent responses such as censored and 
truncated responses. 

The following books treat comparative responses: Marden (1995) consider 
rankings, David (1988) pairwise comparisons and Train (1986, 2003) polyto- 
mous responses or discrete choices. 

Useful books for modeling durations or ‘survival，include (approximately in 
increasing order of difficulty) Singer and Willett (2003)，Allison (1984) ， Hos- 
mer and Lemeshow (1999), Collett (2003)，Breslow and Day (1987), Klein and 
Moeschberger (2003), Therneau and Grambsch (2000), Vermunt (1997), Cox 
and Oakes (1984), Kalbfleisch and Prentice (2002), and Andersen et al. (1993). 
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CHAPTER 3 


Classical latent variable models 


3.1 Introduction 

In this chapter we survey classical latent variable models. The point of de¬ 
scribing these models is threefold: first, to give an overview of how latent 
variables are traditionally used in various branches of statistics; second, to 
familiarize the reader with basic ideas, notation and terminology used later in 
the book; third, to start unifying different approaches in preparation for the 
general model framework discussed in Chapter 4. 

The latent variable models considered here include 

• Multilevel regression models 

• Factor models 

• Item response models 

• Structural equation models 

• Latent class models 

• Models for longitudinal data 

It should be evident that latent variable models are being used for various 
purposes in different academic disciplines, a point that will be repeatedly 
illustrated in the application chapters. Moreover, some of the models are used 
across subject areas whereas others are little known outside specific disciplines. 

3.2 Multilevel regression models 

Multilevel data arise when units are nested in clusters. Examples include stu¬ 
dents in classes, patients in hospitals and left and right eyes of individuals. We 
refer to the elementary units (e.g. students, patients or eyes) as level-1 units 
and the clusters (e.g. classes, hospitals or heads) as level-2 units. If the clus¬ 
ters are themselves clustered into ‘higher level’ (super)clusters, for example 
if students are nested in classes and classes nested in schools, the data have 
a three-level structure. Important developments in multilevel modeling were 
initiated in the school setting; see for instance Aitkin et al. (1981) and Aitkin 
and Longford (1986). 

The units belonging to the same cluster share the same cluster-specific 
influences. For example, students in the same class are taught by the same 
teacher and students in the same school have parents who send them to that 
school (by choice or due to place of residence). However, we cannot expect 
to include all cluster-specific influences as covariates in an analysis. This is 


© 2004 by Chapman & Hall/CRC 




because we often have limited knowledge regarding relevant covariates and 
our dataset may furthermore lack information on these covariates. As a result 
there is cluster-level unobserved heterogeneity leading to dependence between 
responses for units in the same cluster after conditioning on covariates. This 
was illustrated in Figure 1.4 on page 10. 

In multilevel regression, unobserved heterogeneity is modeled by including 
random effects in a multiple regression model. There are two types of random 
effect，random intercepts and random coefficients. Whereas random intercepts 
represent unobserved heterogeneity in the overall response, random coefficients 
represent unobserved heterogeneity in the effects of explanatory variables on 
the response variable. 

3.2.1 Two-level random intercept models 

Let level-2 units, say schools, be indexed j = l,... ,J, and level-1 units, say stu¬ 
dents, be indexed i=l， … ， rij. Consider a two-level random intercept model 
with a single student-specific covariate Xij 

Vij = Voj + PiXij + e i：h (3.1) 

where r]oj are school-specific intercepts, /?i is a regression coefficient and ey 
are level-1 residual terms. The r]oj are modeled as 

Voj = Too + Coj, (3-2) 

where 700 is the mean intercept and 0 ^ is the deviation of the school-specific 
intercept rjoj from the mean. Defining 6 = Var(e^) and ^ = Var(Coj), it is 
typically assumed that the clusters j are independent and 

~|怎 勿〜 N(O,0), 

Cov(e 勿，印】） =0， i ^ i' 

Coj\xij 〜 N(O,0)， 

Cov(^0j 5 ^ij) = 0. 

Note that the first assumption implies that the e^- are uncorrelated with the 
covariate Xij and the third assumption that the (Qj are uncorrelated with 
x^. Somewhat carelessly, the conditioning on is typically omitted when 
these assumptions are stated. We will adhere to this convention from now on, 
keeping in mind that expectations of random terms should be interpreted as 
conditional on covariates. 

The reduced form of the model is obtained by substituting the level-2 
model (3.2) for r]oj into the level-1 model (3.1) for yij, giving 

Vij = 700 + Coj + Pl^ij H~ ^ij • (3-3) 

This model resembles a conventional analysis of covariance (ANCOVA) model 
but the (Qj are random effects of the ‘factor’ school instead of fixed effects of 
school. In the ANOVA terminology, school would be referred to as a random 
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factor (not to be confused with ‘common factors’ to be discussed in Sec¬ 
tion 3.3.2). By assuming a distribution for the intercepts, the school effect 
is captured by a single parameter, the variance instead of a separate pa¬ 
rameter for each school (except one). Treating school as a random factor is 
appropriate if we wish to make inferences regarding the population of schools 
rather than the specific schools in the dataset. We return to the distinction 
between fixed and random effects on pages 81 to 84 of this chapter and in Sec¬ 
tions 6.1 and 6.10 of the estimation chapter. Interactions between the random 
factor school and fixed factors produce random coefficient models, discussed 
in Section 3.2.2. 

The random intercept model is an example of a mixed effects model or linear 
mixed model since it includes both fixed effects 700 and and a random effect 
Coj i n addition to the residual We can partition the reduced form into a 
fixed and random part as follows 

Uij = TOO H - PlXij + (^Qj + €ij , 
fixed part random part 

where the sum of the terms in the random part can be thought of as a total 
residual ^j=Coj + ^- Due to this composition of the error the model is some¬ 
times called an error-component model. The variance of this total residual, or 
equivalently the conditional variance of the responses given Xij, is 

= Var(^oj + ⑷） =*0 + 

This variance is composed of two variance components, the between-school 
variance and the within-school variance 0. If Xij is omitted in (3.1), the 
model is therefore called a variance components model. 

Any two responses yij and y^j in the same level-2 unit are conditionally 
independent given the random intercept Coj and covariate values Xij and 

Gow(ifij , yi’j I Coj ? ^ij ? A’j) = Cov(eij, ) = 0 , i ^ i. 

However, because the random intercepts are shared among students in the 
same school, they induce dependence between responses from students within 
the same school after conditioning on the covariate. This dependence is often 
expressed in terms of the correlation within a cluster, the so called intraclass 
correlation. The intraclass correlation p becomes 

p^Coxiyi^yi-^Xi^Xi'j) = Cor(Coi + €y,Coj+ e^) = ^ + *• 

The intraclass correlation thus represents the proportion of the total residual 
variance ^ 6 that is due to the between-school residual variance 

We can attempt to ‘explain，the between-school variability by including a 
school-level covariate Wj ，such as teacher to student ratio, in the school-level 
model (3.2 )， 

Voj = 7oo + 7oi 忉 j + Coj, (3.4) 

where 700 and 701 are fixed coefficients and Coj now becomes a school-level 
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residual or disturbance. Substituting this model into (3.1), we obtain the re¬ 
duced form 

Uij = Too "I - "Toi 切 + Co) + + 6ij. (3.5) 


Endogeneity 

One of the assumptions of the random intercept model is that E(Coj = 0, 
which implies that Cov(Coj,= 0. If the random intercept represents the 
effect of missing covariates, this assumption will often be violated since these 
missing covariates may well be correlated with the observed covariate. The 
observed covariate is in this case called endogenous，, see Engle et al. (1983) for 
a discussion of endogeneity and exogeneity. 

We will discuss the situation where the specified model is 

Vij = "Too + Co) + PiXij + €ij , (3.6) 

whereas the correct model is that in equation (3.5). In other words, we have 
omitted a cluster-level covariate Wj. If the random intercept represents the 
effects of omitted covariates, it should therefore have expectation 701 % and 
can be written as 

Coj = 701 ^' + Coi- (3.7) 

Omission of Wj can be problematic if Wj and Xij are dependent. The depen¬ 
dence can be expressed as the regression 


W 3 


Q^o + 怎 .j. + Uj, 


where x,j is the cluster mean of x^. Note that we have regressed Wj on the 
cluster mean x.j instead of because = (x^ -x,j)-\-xj and the regression 
coefficient of Wj on (x^ — Xj) is zero. 

Substituting for Wj in (3.7), we see that the random intercept depends on 

x.j, 




7^ao + 7oi^i^.j 


+ Toi^i + Coj 


= So -\-TUj. 

Substituting for in (3.6), the reduced form model becomes 
Vij = TOO + Jo + + f3\Xij + ZUj + €ij . 


(3.8) 


We see that by including the cluster mean x,j as a separate covariate, the 
coefficient of becomes the required parameter in the correctly speci¬ 
fied model although we have omitted the cluster-level covariate Wj. However, 
omitting x,j from the model will yield a biased estimate of Pi if Si=joi^i ^ 0. 

It may be useful to revisit the hypothetical example briefly discussed in 
Section 1.4. Here the units i are people and the clusters j countries. The re¬ 
sponse yij is length of life, exposure to red meat and Wj some index of the 
country’s standard of living. The country-level random intercept, represent¬ 
ing unexplained variability between different countries’ life expectancies, could 


© 2004 by Chapman & Hall/CRC 






well include the effect of the omitted variable standard of living Wj which in 
turn is correlated with the average red meat consumption x,j. We would ex¬ 
pect 701 > 0, ai > 0 so that Si > 0. This positive effect of the country-mean 
x,j on the country-specific intercept can be seen in Figure 1.7 on page 13. 
Using the misspecified model (not including xj as a separate covariate), the 
negative true effect of red meat on life expectancy would therefore be under¬ 
estimated due to the country-level positive relationship between red meat and 
life expectancy induced by the omitted covariate standard of living. 

The model in (3.8) can alternatively be written as 

Vij = 700 + ^0 + (^1 + Pl)x.j + - X.j) + VJj + 6ij, 

where <5i+/3i is the between-cluster effect and f3\ the within-cluster effect (this 
parameterization may be preferable because the covariates and Xij — x,j 
are uncorrelated). A Wald test of the equality of the between-cluster and 
within-cluster regression effects (i.e.，a test of the null hypothesis that 5i = 0 ) 
is identical to the Hausman specification test for the random intercept model 
(e.g. Hausman, 1978); see also Section 8.5.1. Some economists believe that a 
significant Hausman test implies that the random intercept model must be 
abandoned in favor of a fixed effects model. However, this is misguided since 
f3i can be estimated without bias as long as the cluster mean x.j is included as 
covariate in addition to as shown above. See Snijders and Berkhof (2004) 
for further discussion where it is also shown that analogous results hold for 
models with several random effects described in the next section. 

Note that correlations between the level-1 residual e^- and covariates remain 
a potential problem as in any linear regression model. Unfortunately, this 
problem is not as easily discovered and overcome as for the random intercept. 


3.2.2 Two-level random coefficient models 

The two-level random intercept model can be extended to a random coefficient 
model by allowing the slope of a covariate to vary between clusters, for 
instance schools. The extended model can be expressed as 

Vij = Voj + Vij x ij + (3.9) 

where is a student-level covariate such as gender and r]oj and 7]\j are 
the intercept and slope for the jth. school, respectively. The between-school 
variability of the intercept is modeled as before and we add a similar model 
for the slope, 

Voj = 7oo + 7oi^j + Coi, 

Vij = 7io +7n^i + Cii* (3.10) 

The school-level residuals or disturbances (oj,(ij are specified as bivariate 
normal with zero means, variances 也 and -02 and covariance -021 • 
Substituting the school-level models for the coefficients rjoj and rjij in (3.10) 


© 2004 by Chapman & Hall/CRC 


into the student-level model in (3.9), we obtain the reduced form 
Hij ; Too H - 'Toi% + Co) + ("Tio + 7 ii 切 j + Cij) ^ij 

V0j vij 

= Too H - H - 710^7’ + H - Coj H~ Cij • 而 j + ^ij • 

In contrast to the random intercept model, the random coefficient model 
induces heteroscedastic responses since the conditional variance, 

Ydx{yij\xij,Wj) = + 2^2iXij + + 0, (3.11) 

depends on Xij. It follows that the intraclass correlation in this case also 
depends on covariates. 

It is important to note that the random intercept variance and the correla¬ 
tion between intercept and slope are not invariant to translation of x^. This 
can be seen in Figure 3.1 where identical cluster-specific regression lines are 
shown in two panels, but with the explanatory variable x[j = Xij—3.5 in the 
right panel translated relative to the explanatory variable in the left panel. 
The intercepts are the intersections of the regression lines with the vertical line 
at zero. It is clear that these intercepts vary more in the left panel than the 
right panel, whereas the correlation between intercepts and slopes is negative 
in the left panel and positive in the right panel. 



Figure 3.1 Cluster-specific regression lines for random coefficient model, illustrating 
lack of % 


k of invariance under translation of explanatory variable 


It is convenient to formulate the reduced form of the random coefficient 
model in vector notation 

Vij = 

/?0 PlXUj ^2X2ij p3X3ij 

=(Po + PlXiij + ^2X2ij + Ps^Sij) + (Coj + Clj z lij) + e ij 
=x^./3 + z'ijCj + £ij, (3.12) 

where the x q ij denote covariates with fixed coefficients and z q ij covariates with 
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random coefficients. The covariate vector x^- = (1, Xuj, X 2 ij, xsij) for the fixed 
effects /3 f = (/?o, Hfh) includes both student and school-level covariates as 
well as products of student and school-level variables representing cross-level 
interactions, and Cj = {CojXij)- 

Two-level models can either be specified through separate models at levels 1 
and 2 as in (3.9) and (3.10) or directly through their reduced form as in (3.12). 
The former approach is used for instance by Raudenbush and Bryk (2002) 
whereas the latter is used for instance by Goldstein (2003) and Rabe-Hesketh 
et al. (2004a). 

See also page 85 for a discussion of random coefficient models or 4 growth 
curve models，for longitudinal data. 


Two-level model in matrix notation 

A two-level model can be written in matrix notation by stacking all responses 
into a single vector y. Correspondingly, we stack the row vectors x^- into the 
matrix X. We will in the sequel adhere to standard terminology and denote 
this matrix as a 4 design’ matrix, although we acknowledge that its values are 
not necessarily determined from an experimental design (see Kempthorne, 
1980). Letting ^(d) denote the vector of all random effects, the matrix equation 
for the entire sample becomes 

y = X/3 H- Z(^)C(^) + 6, (3.13) 

where the subscript (D) denotes that the matrix or vector contains elements 
for the entire dataset (here y = y(z>), X 三 X(p) and e 三 €(p)). Note that 
'Zi(D) is a block-diagonal matrix with blocks corresponding to level-2 units. To 
see this, consider a model with two covariates having both fixed and random 
coefficients, where Xuj = zuj and X 2 ij = Z 2 ij. For two level-2 units, the first 
containing 3 level-1 units and the second 4 level-1 units, the matrix equation 
is written out in full for the individual level-1 units in Display 3.1 B.i on page 
56. 

For a single level-2 unit the model becomes 

yj(2) = x j ⑺ /3 + z j(2)Cj + 勺⑺， （ 3.14) 

where >^( 2 )， ^( 2 ), Cj and €)( 2 ) are the rows in the unit-level representation 
in Display 3.1 B.i corresponding to the jth. level-2 unit whereas Zj( 2 ) is the 
pertinent block from the block diagonal design matrix for the random effects, 
see Display 3.1 C.i. Here the subscript j(2) indicates that the vectors and 
matrices contain all elements for the jth level-2 unit. Note that q 三 
Letting 屯 be the covariance matrix of Cj, the conditional covariance struc¬ 
ture for the responses of a level-2 unit can be written as 

%⑶三 CovCy^lXj • ⑶， Zj • ⑶） =Zj • ⑵中 Z; • ⑵ (3.15) 
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3.2.3 Three-level models 

A two-level model for students nested in schools may be unrealistic since 
students are also nested within classes (which are themselves nested within 
schools). We would hence expect the correlation between responses of two 
students from the same school to be higher if the students also belong to the 
same class. This can be modeled using a three-level model. 

Extending (3.12), a general three-level model for level-1 units i (e.g. stu¬ 
dents), level-2 units j (e.g. classes) and level-3 units k (e.g. schools) can be 
written in reduced form as 

Vijk = x^/3 + zg'cg) + zg'cif) +e ijk . (3.16) 

The terms represent, respectively, the fixed part of the model, the level-2 ran¬ 
dom part, the level-3 random part and the level-1 residual, is a vector 
of explanatory variables (including the constant) with fixed regression co¬ 
efficients /3, z^j k is an M2-dimensional vector of explanatory variables with 
random coefficients at level 2 and z 爲 is an M 3 -dimensional vector of ex¬ 
planatory variables with random coefficients at level 3. The superscripts 
attached to the random effects indicate the level at which they vary whereas 
the superscripts attached to the covariates indicate the level at which the cor¬ 
responding random coefficients vary. The random effects at each level have 
a multivariate normal distribution and random effects at different levels are 
mutually independent and independent of the level-1 residual. 

Extending (3.13)，the three-level model can be written in matrix notation 
for the entire sample as 

y = X/3 + Zg)cg) + Z^Cj^+6. (3.17) 

It is sometimes convenient to use a single design matrix 
Z (D) = [ Z (D)> Z (D)] > 

with a corresponding vector of random effects 

C(D)= 廣翁义勸 '， 

where stands for all random effects for the entire dataset, including both 
level-2 and level-3 random effects. The model can then be expressed as 

y = X/3 + Z( £ ))C( i：) ) + €. (3.18) 

This formulation is shown in detail in Display 3.2 B.i on page 59 for a random 
intercept model. Here two level-3 units each contain two level-2 units contain¬ 
ing two level-1 units each. We have permuted the columns of Z ⑼ to obtain a 
block-diagonal form with blocks Z X (3) for the units in the first level-3 cluster 
and Z 2 (3) for the units in the second level-3 cluster. The vector of random 
effects C/c(3) is correspondingly permuted. This allows the model for the kth 
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level-3 unit to be written as 

y/c(3) = X fe(3)/3 + Z fe(3)Cfc(3) + e fe(3)； 

see Display 3.2 C.i. 

The model now looks algebraically equivalent to a two-level model. This 
has the advantage that we can apply any results specifically developed for 
two-level models to higher-level models. For example, we can directly apply 
equation (3.15) to obtain the conditional covariance matrix for the responses 
of the kth level 3 unit, 

〜⑶三 Cov(yfc ⑶ |X/c( 3 ) ， Zfc( 3 )) = Zfc ⑶屯 fc( 3 )Zt( 3 ) + ofl ， (3.19) 

where 屯 is the covariance matrix of all random effects for the kth. 
level-3 unit. 


3.2.4 Higher-level models 

A general L-level model can be written as 

y = + + e, (3.20) 

1=2 

where the fixed part is as before, z ⑴ is an M^-dimensional vector of explana¬ 
tory variables with random coefficients C ⑴ at level Z, and we have dropped the 
unit and cluster indices to simplify notation. The random effects at a given 
level l are usually assumed to have a multivariate normal distribution with 
zero mean and covariance matrix 屯 ⑴. The random effects at different levels 
are assumed to be mutually independent and independent of the residual error 
term. 

3.2.5 Generalized linear mixed models 

The multilevel models discussed so far have been for continuous responses. 
However, all the response types discussed in Chapter 2 can be accommodated 
by specifying the conditional distribution of the responses given the random 
effects as a generalized linear model with a linear predictor v of the same form 
as the conditional mean in (3.20), 

V = x^+^z^C®. 

1=2 

The resulting model is called a generalized linear mixed model The linear 
mixed model is the special case with an identity link and conditionally nor¬ 
mally distributed responses. 

In generalized linear mixed models the regression coefficients represent con¬ 
ditional effects of covariates, given the values of the random effects. These 
effects can be interpreted as cluster-specific effects. In contrast, marginal or 
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population averaged effects are effects of covariates after integrating over the 
random effects. The difference between conditional and marginal effects for a 
random intercept probit model was shown graphically in Figure 1.6 on page 12. 
Note that the present notion of marginal effects is different from that some¬ 
times used in econometrics where it typically refers to the effect of a small 
change in a covariate (given the others) on the expected response in mod¬ 
els without latent variables (e.g. Greene, 2003). With an identity link, the 
conditional effects are equal to the marginal effects，but this is not generally 
the case. We refer to Section 3.6.5 for further discussion, Section 4.8.1 for 
derivations and Section 9.2 for an example. 

In generalized linear models the variance of the responses given the random 
effects and covariates, the 4 level-1 variance’，is determined by the variance 
function of the specified conditional distribution. If the responses are counts, 
modeled as conditionally Poisson or binomial, overdispersion at level 1 may be 
modeled by including a random intercept at level 1 (e.g. Breslow and Clayton, 
1993). Goldstein (1987), Schall (1991) and others adopt a quasi-likelihood 
approach by including an extra dispersion parameter in the level-1 variance 
function (see Section 2.3.1). 

Since the level-1 variance is generally not constant, the correlation between 
observed responses in the same cluster is also not constant even in a simple 
random intercept model (e.g. Goldstein et al” 2002). For dichotomous and or¬ 
dinal responses, the intraclass correlation is therefore often expressed in terms 
of the correlation between the latent responses y* - which is constant. For a 
random intercept probit model, this type of intraclass correlation becomes 

P = Cor (y*., y* Vj | Xii , ) = —y, 

known as the tetrachoric correlation in the dichotomous case without covari¬ 
ates. For a logit model, the ‘1’ in the denominator is replaced by 7r 2 /3, the 
variance of the logistic level-1 error. 


3.2.6 Models with nonhierarchical random effects 
Models with crossed random effects 

So far, we have discussed hierarchical models where units are classified by 
some factor (for instance school) into top-level clusters at level L. The units in 
each top-level cluster are then (sub)classified by a further factor (for instance 
class) into clusters at level L—l, etc. The factors defining the classifications are 
nested in the sense that a lower-level cluster can only belong to one higher-level 
cluster (for instance a class can only belong to one school). 

We now discuss non-hierarchieal models where units are cross-classified by 
two or more factors, with each unit potentially belonging to any combination 
of ‘levels’ of the different factors. A prominent example is panel data where the 
factor ‘individual’ (or country, firm, etc.) is crossed with another factor ‘time’ 
or occasion. While unit-specific unobserved heterogeneity is often ax3commo 


© 2004 by Chapman & Hall/CRC 


dated using random effects (see Section 3.6.1), random effects modeling of 
occasion-specific unobserved heterogeneity, due to shared experience of events 
at each occasion such as strikes, new legislation or weather conditions, ap¬ 
pears to be confined to econometrics. If both factors are treated as random, 
econometricians call the model a two-way error component model (e.g. Balt- 
agi, 2001， Chapter 3). Models with cross-classified random effects also arise in 
4 generalizability theory’. Here a simple design is a two-way cross-classification 
of subjects to be rated and raters; see also Section 3.3.1. 

Consider students who are cross-classified by elementary school and high 
school. Ignoring elementary school, a two-level random intercept model for 
students i nested in high schools j would be specified as 

吃 =x^./3 + cf. (3.21) 

The corresponding hierarchical structure of students nested in high schools is 
shown in Figure 3.2 where each student is represented by a short vertical line 
to the right of the long vertical line representing his or her high school. 

Including an additional random intercept (^ p for the students’ elementary 
school p destroys the nesting since students from the same high school do not 
necessarily come from the same elementary school and vice versa. This can be 
seen in the figure where the lines connecting students with their elementary 
schools are crossed. Note that students cannot be reshuffled to untangle the 
crossings. The model with crossed random effects can be written as 

u ijp = + Cj ) + Cp ， 


where the p subscript has been added in the fixed part to accommodate ele¬ 
mentary school-specific covariates. 

Goldstein (1987) described a trick for expressing a model with crossed ran¬ 
dom effects as a hierarchical model with a larger number of random effects. 
This is important because many estimation methods are confined to models 
with nested random effects. In the present example, we must first introduce 
a ‘virtual’ level within which both elementary schools and high schools are 
nested, such as towns, and call this level 3. Note that this level does not need 
to be ‘natural’ in any sense; for instance, if one child moved to another town to 
attend high school, these two towns could be merged into a single level-3 unit. 
If it is not possible to find a virtual third level, this level could be defined as 
a single unit encompassing the whole dataset. In Figure 3.2 the virtual level 
is shown by vertical lines spanning groups of high schools and elementary 
schools. Students within a virtual level-3 unit can only belong to high schools 
and elementary schools within that unit (no crossing between level-3 units). 

Now label the elementary schools arbitrarily within each level-3 unit (e.g. 
town) as p 7 = 1,..., n max , where n max is the maximum number of elementary 
schools within a level-3 unit. Using the k subscript for level 3, we can write 
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the model above equivalently as 

v ijp = + C; 2 ) + Cit) 木 p ’， 

P ，=i 

where is the p’th random ‘slope’ at level 3 and di P ，equals 1 if student 
i went to any of the elementary schools labeled p r and zero otherwise. Here 
the random intercepts for all elementary schools numbered p r are represented 
by (different realizations of) a random slope (varying at the third level) of a 
dummy variable for elementary schools p f . The variances of the random slopes 
are constrained to be equal (to achieve a constant random intercept variance in 
the original model) and their covariances set to zero (since elementary schools 
are mutually independent). The covariances between the random slopes and 
the random intercept for high school are also zero. 

In the figure n max = 3 and the elementary schools within a virtual level-3 
unit are labeled from 1 to at most 3. The three long vertical lines for virtual 
level 3 represent values of the random effects for elementary schools labeled 
1， 2 and 3. The dots indicate for which students the corresponding dummy 
variables dn to dis are equal to 1. For instance, the first three dots from the 
top signify that the first three students from the top belong to elementary 
schools 1, 2 and 3, respectively. The reason a single random effect can be used 
for several primary schools (all primary schools with the same label) is that 
the random effect takes on a different value for each primary school. 

Unfortunately, formulating models with crossed effects as multilevel models 
becomes unfeasible when n max becomes too large. In this case Markov chain 
Monte Carlo methods such as the AIP algorithm discussed in Section 6.11.5 
may be used. 

See Snijders and Bosker (1999, Chapter 11)，Raudenbush and Bryk (2002, 
Chapter 12) and Goldstein (2003，Appendix 11.1) for further discussion of 
models with crossed random effects. 

Multiple-membership models 

Ignoring elementary schools, we now return to the model in (3.21) with a 
random intercept for high schools. If a student i has attended several high 
schools {j}，spending time Uh in school h € {j}, a reasonable Multiple- 
membership 5 model might be 

v i{j} = (i)tih ， 

he{j} 

where (f) represents the effect, per time unit ‘exposed，，of attending school 
h. (Sometimes Uh is replaced by the proportion of time spent in school h, by 
dividing by the total time J2 h tih.) 

For simplicity we have assumed that there are no school-specific fixed effects. 
There are two problems making this model nonnested. First, for students 
attending (at least) two schools, the first and second schools attended are 
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Data structure 

Figure 3.2 Model structure and data structure for students in high schools crossed 
with elementary schools 


crossed, i.e. students attending the same first school could end up in different 
second schools and vice versa. Second, the random effect of a particular school 
must take on the same value, whether it is a given student’s first，second or 
third school. This aspect is different from crossed-effects models. 

Hill and Goldstein (1998) proposed a trick for representing these models 
as hierarchical models; see also Rasbash and Browne (2001). For the present 
example, we first need to find a third level so that students only cross between 
schools within a level but not between levels (e.g. towns). Labeling the schools 
within each level-3 unit arbitrarily as /i = 1, •.., n max , where n max is the 
maximum number of schools per level-3 unit, the model can be written as 

U i{j} = X W+ ^ Cj^hUhdjh, 

h=l 

where is the hth random slope at level 3 and dih is a dummy variable equal 
to 1 if student i ever attended any of the schools labeled h and 0 otherwise. 


3.3 Factor models and item response models 

3.3.1 Platonic true scores: measurement models 

Platonic true scores exist, can in principle be measured and would in this 
case represent a gold standard. However, error prone measures are often used 
instead of the gold standard for practical reasons. For instance, Roeder et 
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al. (1996) considered an application where 4 low-density lipoprotein choles- 
teroP was measured using the less costly and time-consuming measure 4 total 
cholesterol' 

A simple measurement model, standard in ‘classical test theory’ (see e.g. 
Lord and Novick, 1968), can be written as 

Uij = Vj + e ij 5 (3.22) 

where yij is the zth measurement on the jth unit, r]j is the true score for unit 
j with variance ^ and are measurement errors with variance 6. 

It is usually assumed that the are mutually independent so that the xjij 
are conditionally independent given . It is furthermore assumed that the 
have zero expectation and are independent of the true score. However, if we 
have a validation sample for which the gold-standard is available in addition 
to the fallible measures, we can assess these assumptions. In particular, we 
say that a measurement is biased if the expectation of the measurement error 
is not zero. 

The reliability p can be defined as the proportion of the total variance of 
the measurements that is due to the true score variance 
Var(r^) 矽 

P Var(r^) + Var(e^) 矽 + 0 • 

Note that this is just the intraclass correlation for a random intercept or one¬ 
way random effects model discussed in Section 3.2.1. 

The simple one-way random effects model is appropriate if the measure¬ 
ments on each person can be considered exchangeable replicates, but this may 
be unrealistic. For instance, if exams are marked by a panel of examiners or 
raters i, a two-way model 

Uij = A H~ Vj + e ij 

may be more appropriate, where pi represents the response bias for rater i. 

If the raters are considered a random sample of possible raters, pi can 
be treated as random, giving a two-way random effects model. Note that the 
random effects for subjects and raters are not nested but crossed if each person 
is assessed by each rater (see Section 3.2.6). 

If the raters are considered fixed , 爲 are fixed effects and the model is a two- 
way mixed effects model. Subject-rater interactions can be included in both 
types of model if each rater provides replicate measurements for each subject. 
These models and the different types of reliability coefficients that can be 
derived from them are discussed in Shrout and Fleiss (1979) and McGraw 
and Wong (1996). 

Treating (3i as fixed, we can allow the measurement scales and reliabilities 
to differ between methods (e.g. raters). The congeneric measurement model 
(Joreskog, 1971b) is specified as 

Uij = Pi XiVj+6ij ， (3.23) 

where 爲 are fixed parameters, E(r^) = 0 ,外三 Var(r^) and 6u = Var(e^). In 
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this model Pi represents rater bias as before whereas is a rater-specific scale 
parameter. We can interpret \iT]j as the true score measured in the units of 
rater i. 

The scale of the true score % is typically fixed to that of yij by setting Ai 二 1, 
an identification restriction of the kind to be discussed in Section 5.2.3. The 
measurement error variances On may differ between raters. The reliability for 
rater i becomes 



The congeneric measurement model implies the following covariance struc¬ 
ture for the vector of measurements y^- for unit 


三 Cov(yj)= 


入2岭入1 入!分+ 022 

Aj 畛 Ai 入/岭入2 … 


Three common special cases of the congeneric measurement model are 

• the essentially tau-equivalent measurement model where = A 

• the tau-equivalent measurement model where = A and Pi = /3 

• the parallel measurement model where Xi = X, Pi = p and 6u = 0, giving 
the simple measurement model in (3.22). 

The interpretation of as measurement error presupposes that the raters 
are measuring the same thing. However, if the raters are influenced by differ¬ 
ent idiosyncracies, such as handwriting or spelling, the expectation for rater 
i could be expressed as pi + 入名％ + Sij, where Sij is the subject-specific bias 
(e.g. due to handwriting or spelling) or, in other words, a rater by subject 
interaction (e.g. Dunn, 1992). As mentioned above, the variances of the spe¬ 
cific factors cannot be separated from the measurement error variances unless 
replicate measurements are available for each rater and subject. 

It is crucial to distinguish the Berkson measurement model (Berkson, 1950) 
from the classical measurement models discussed above. In the Berkson model 
it is assumed that the true scores are normally distributed around the 
measured score yj 

Vij = Uj + e Bij ? 

where esij is the Berkson error. A situation where such a model may be ap¬ 
propriate is when yj is a controlled variable. For example, an experimenter 
may aim to administer a given dose of a drug but the actual true dose given 
on occasion r]ij, differs from yj due to measurement error. The Berkson mea¬ 
surement model therefore assumes the measured response to be independent 
of the measurement error, whereas the classical measurement model does not. 
This has important implications for regression with covariate measurement 
errors as discussed in Section 3.5. 
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3.3.2 Classical true scores: common factor models 

Unlike platonic true scores, classical true scores or hypothetical constructs 
cannot be measured directly even in principle, intelligence being a typical ex¬ 
ample. The construct is instead measured by different indicators or items such 
as problems in an intelligence test. In contrast to a measurement model, the 
individual items cannot be said to measure intelligence per se in the sense that 
the expectation could be interpreted as 4 true’ intelligence. Instead, different 
aspects of intelligence are measured by different items; for instance items will 
often require different 4 blends’ of verbal, quantitative, abstract and visual rea¬ 
soning. The answer to a particular item is therefore a reflection of both general 
intelligence and an item-specific aspect, referred to as the common and spe¬ 
cific factor, respectively. We refer to Section 1.3 for an extensive discussion of 
hypothetical constructs. 

A unidimensional common factor model for items i = 1 ,. .., / can be written 
as 

Vij = + XiVj + €ij • (3.24) 

Here, r]j is the common factor or latent trait for subject is a factor 

loading for the zth item and e^- are unique factors. We define 矽 三 Var(r^) and 
Ou=Va,r(eij) and let r]j be independent of Cij. 

The scale of the common factor is either fixed by ‘anchoring’ (typically 
fixing the first factor loading, Ai = 1) or ‘factor standardization’ (fixing the 
factor variance to a positive constant, 岭 =1). Although the models resulting 
from either identification restriction are equivalent (see Section 5.3)，anchoring 
is beneficial from the point of view of ‘factorial invariance’ (see e.g. Meredith, 
1964 ， 1993; Skrondal, 1999). For instance, assume that model (3.24) holds 
for a population but we consider the subpopulation of units with negative 
factor values. In this case the original factor loadings are recovered in the 
subpopulation under anchoring (with a reduced variance estimate 众 ） but not 
under factor standardization. 

Note that the intercept /3i cannot be interpreted as measurement bias in 
the present context. The intercept can be omitted if the item-specific mean 
yi, has been subtracted from yij. Also note that the unidimensional common 
factor model and the congeneric measurement model presented in (3.23) are 
mathematically identical. 

The unique factor can be further decomposed as 

^ij = Sij + Cij , 

the sum of a specific factor and measurement error taken to be mutually 
independent and independent of rjj. Note that the specific factor has a similar 
interpretation to the rater by subject interaction in a measurement model. 
The specific factor is generally considered to be part of the true score for an 
item in which case the reliability becomes 

_ Var(^) 

Pl + Var (〜) + Var(e^) * 
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Unfortunately, in most designs the variances of the specific factors and mea¬ 
surement errors are not separately identified because there are no replicates 
for the individual items. Replication in terms of longitudinal or multimethod- 
multitrait designs is sometimes used in an attempt to decompose the unique 
factor into specific factors and measurement errors (e.g. Alwin，1989 and the 
references therein). In the absence of replicates, the reliabilities for factor 
models are often somewhat carelessly expressed as 

p . = __ 

内 — + Var( Sij ) + Var( eii )' 

which then represent lower bounds of the true reliabilities. 

Closely related to this reliability is Cronbach’s a which can be interpreted 
as the maximum likelihood estimator of the reliability of an unweighted sum- 
score, estimated without replicates, by assuming that the items are parallel 
measurements (e.g. Novick and Lewis, 1967). See Greene and Carmines (1980) 
for a discussion of a and other reliability measures for sumscores. 

The factor model can also be written in matrix notation as 


Yj = (3 + Arjj+ej, 

where /3 is a / x 1 vector of intercepts, A is a / x 1 vector of factor loadings, 
€j a, I x 1 vector of unique factors and I is the total number of items. 

The covariance structure, in this case called a factor structure, becomes 

A ! 岭 + 022 
Aj 畛 入 2 … 

where 0 is a diagonal matrix with the 6u placed on the diagonal. Note that 
the covariance structures for the unidimensional common factor model and 
the congeneric measurement model are identical. To fix the scale of the factor 
we typically either fix a factor loading to one or the factor variance to one 
(see Section 5.2.3). 

The models discussed in this section have all been reflective with the items 
interpreted as reflecting or being 4 caused，by a latent variable. However, it 
sometimes makes more sense to construe latent variables as formative, being 
‘caused’ by the items. A standard example is measurement of socio-economic 
status (SES) for a family, based on the education and income of adult family 
members. In this case a reflective model is dubious; we would expect education 
and income to affect SES and not the other way around. Using factor models 
in the formative case would entail a misspecification. We refer to Edwards and 
Bagozzi (2000) for an overview of different types of relations between items 
and constructs. 


ft = Cov(yj) = + © : 


A ? 料吣 
入2岭入1 
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3.3.3 Multidimensional factor models 

The unidimensional factor model imposes a rather restrictive structure on 
the covariances. In structuring 1(1 + 1)/2 variances and covariances only 21 
parameters are used. Hence, less restrictive multidimensional factor models 
are often useful. An M-dimensional factor model can be formulated as 
Vij = /?i + MiVij + ••- + XiMVMj + hj 

: : : : : (3.25) 

Vij = Pi ^nVij + … + XiMVMj + €ij. 

Such a model can alternatively be expressed in matrix form as 
Yj = /3 + AyTJj + €j, 

where /3 is a vector of constants (usually omitted if y^- is mean-centered), A y 
is a factor loading matrix, a vector of M common factors with covariance 
matrix 屯 and €j a vector of unique factors with diagonal covariance matrix 
©. The covariance matrix of the responses becomes 

n = Ay^Ay f + 0. (3.26) 

Confirmatory factor analysis 

If prior information is available, in terms of substantive theory, previous results 
or employed research design, confirmatory factor analysis (CFA) should be 
used where particular parameters are set to prescribed values, typically zero. 
For instance, A is often specified as an independent clusters structure (e.g. 
Joreskog, 1969; McDonald, 1985) where each item loads on one and only one 
common factor. 

For example, Mulaik (1988b) considered 9 subjective rating-scale variables 
designed to measure two c dimensions’ or factors in connection with a soldier’s 
conception of firing a rifle in combat. The first factor, supposed to be fear, 
had as indicators the four scales 4 frightening’ ， c nerve-shaking 5 , ‘terrifying’，and 
‘upsetting’. The second factor, optimism about outcome, had as indicators 
the five scales ‘useful’ ， ‘hopeful’ ， 4 controllable’ ， ‘successful’，and ‘bearable’. 
The loadings of variables on irrelevant factors were hypothesized to be zero, 
whereas the factors were expected to be (negatively) correlated. 

An independent clusters two-factor model where each factor is measured by 
three nonoverlapping items can be written as 


where we have fixed the scale of each factor by setting one factor loading to 
1. A path diagram of this model is given in Figure 3.3. Here circles repre- 
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;s or rectangles represent residual variability. Curved double-headed arrows 
nnecting two variables (here the factors) indicate that the variables are 
rrelated. 

Factors are sometimes also specified as uncorrelated by setting pertinent 
-diagonal elements of 屯 to zero. Confirmatory factor analysis is thus a 
potheticist procedure designed to test hypotheses about the relationship 
tween items and factors, whose number and interpretation are determined 
advance. 


rploratory factor analysis 

fundamentally different approach is exploratory factor analysis (EFA). Fol¬ 
ding Mulaik (1988a), exploratory factor analysis can be construed as an 
iuctivist method designed to discover an optimal set of factors, their num- 
r to be determined in the analysis, that accounts for the covariation among 
e items (see also Holzinger and Harman, 1941). Each factor is then inter- 
eted and ‘named’ according to the subset of items having ： hieh loadings on 







loading matrix A y by R _1 , 

^ = (A y R 1 )R^R(R 1 A ， 2/ ) + 0. 

If R is orthogonal, the transformation is a rotation or reflection. 

In confirmatory factor analysis, restrictions on the factor loadings serve to 
fix the factor rotation, and combined with constraints for the factor scales (ei¬ 
ther by fixing factor variances or by fixing one factor loading for each factor) 
will often suffice to identify the model. In exploratory factor analysis, a stan¬ 
dard but arbitrary way of identifying the model is to set the factor covariance 
matrix equal to the identity matrix, 屯 =I, and fix the rotation for instance 
by requiring that A^0A y is diagonal (e.g. Lawley and Maxwell, 1971). 

Exploratory factor analysis is often confused with principal component anal¬ 
ysis and we therefore give a brief description of the latter. Principal compo¬ 
nents are linear combinations a’y of the responses where a’a = 1 • The coeffi¬ 
cients or ‘component weights’ for the first principal component are determined 
to maximize the variance of the principal component. The coefficients of each 
subsequent principal component are determined to maximize the variance of 
the principal component subject to the constraint that it is uncorrelated with 
the previous components. The covariance matrix of the responses is therefore 
decomposed as 

Cov(y) = A^*A ’， 

where the rows of A are the coefficient vectors a and 屯 * is the diagonal covari¬ 
ance matrix of the principal components. The rows of A are the eigenvectors 
of Cov(y) and the diagonal elements of are the corresponding eigenvalues. 
Important differences from the factor structure in (3.26) are that there is no 
unique factor covariance matrix © and that the components cannot be cor¬ 
related. Principal component analysis is a data reduction technique since the 
first few principal components may capture the main features of the original 
data-in terms of the 4 percentage of variance explained' Unlike factor anal¬ 
ysis, there is no statistical model underlying principal component analysis - 
it is merely a transformation of the data. An advantage of factor analysis as 
compared to principal component analysis is that there is a simple relation¬ 
ship between estimates based on different scalings of the responses (e.g. Krane 
and McDonald, 1978). 

An exploratory factor analysis usually proceeds through the following rather 
ad hoc steps. First the number of factors is determined based on a principal 
component analysis of the correlation matrix. The number of factors is typ¬ 
ically chosen to be equal to the number of eigenvalues that are larger than 
one, the so-called Kaiser-Guttmann criterion. Sometimes, however, a so-called 
scree-plot is used where the eigenvalues are plotted against their rank and the 
number of factors is indicated by the 4 elbow’ of the curve (Cattell, 1966). Sec¬ 
ond, a factor model with the chosen number of factors is estimated. There are a 
number of methods for this including maximum likelihood (e.g. Bartholomew 
and Knott, 1999). However, some software packages actually use the principal 
components as factors and the component weights as factor loadings. 
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It is typically difficult to ascribe meaning to the factors at this stage since 
most items will have nonnegligible loadings on most factors. Third，a trans¬ 
formation matrix R is therefore used to produce more interpretable loadings 
according to some criteria such as loadings being either ‘small’ or 4 large’ (e.g. 
Harman, 1976; McDonald，1985). If uncorrelated factors are required, R must 
be orthogonal and the transformation is a rotation; otherwise it is referred 
to as the misnomer ‘oblique rotation’. The fourth and final step is to retain 
only the ‘salient，loadings, interpreting as zero any loadings falling below an 
arbitrary threshold, typically 0.3 or 0.4. 

The tenability of this final model is never assessed and it may not even fit 
the data used for exploration. For this reason, and since modeling is purely 
exploratory, it is hardly surprising that such models are usually falsified by 
confirmatory factor analyses in other samples (e.g. Vassend and Skrondal, 
1995， 1997, 1999, 2004). We return to the philosophical differences between 
the exploratory and confirmatory approaches in Section 8.2.2. 

A confirmatory factor model equivalent to the traditional exploratory factor 
model with M factors (in the sense to be defined in Section 5.3) can be 
specified by judiciously imposing M 2 restrictions as in the exploratory model. 
The ‘reference solution 5 of Joreskog (1969, 1971a) is obtained in the following 
way: 

• Fix the factor variances by imposing the M restrictions = -022 = • • •= 
V ; mm = 1 (there are no restrictions on the correlations) 

• Pick an 4 anchor’ item i m for each factor m, preferably one with a large load¬ 
ing for the factor and small loadings for the other factors. Impose 入。，允= 0 
for all other factors r]k, k^m 

This is useful since exploratory factor analysis can then be performed via con¬ 
firmatory factor analysis, taking full advantage of the facilities for statistical 
inference within the latter approach. We use this approach for investigating 
the dimensionality of political efficacy in Section 10.3. 


3.3,4 Item response models 

The unidimensional factor model can be extended to dichotomous and ordinal 
responses using two different approaches (e.g. Takane and de Leeuw, 1987). 
Factor analysts typically use a latent response formulation as described in 
Section 2.4. In this case latent responses y*j simply take the place of the 
observed responses yij in the conventional factor model. In item response 
theory (IRT), on the other hand, the generalized linear model formulation is 
typically used. Here the conditional probability of a particular response given 
the latent trait (or factor), the so-called item characteristic curve^ is specified 
by a link function, typically a logit or probit. 

In this formulation, the single factor model is known as a two-parameter item 
response model since there are two parameters associated with each item, an 
intercept and a factor loading. The classical application of these models is in 
ability testing, where the items i represent questions or problems in a test and 
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the answers are scored as right (1) or wrong (0). In this setting rij represents 
the ability of person j and the model is typically parameterized as 


Here, bi can be interpreted as the item difficulty, giving a 50% chance of a 
correct answer when ability equals difficulty, whereas ai is an item discrimi¬ 
nation parameter (or factor loading) determining how well the item discrim¬ 
inates between subjects with different abilities. Figure 3.4 shows examples of 
item characteristic curves for a two-parameter logistic (2-PL) item response 
model (Birnbaum, 1968). The solid and dashed curves are for items with the 
same difficulty b but different discrimination parameters (slopes) a, whereas 
the solid and dotted curves are for items with the same discrimination pa¬ 
rameter a but different difficulties (horizontal shifts) b. We will estimate a 
two-parameter item response model for items testing arithmetic reasoning in 



In the two-parameter model the probability of answering correctly tends to 
zero as ability goes to minus infinity. However, this is unrealistic if multiple 
choice formats are used, since guessing would produce a nonzero probability of 
answering correctly even for people with dismal abilities. An extra parameter 
is therefore sometimes introduced into the two-parameter model leading to 
the three-parameter logistic item response model (Birnbaum, 1968), see Sec¬ 
tion 9.4 for an example. Unfortunately, huge samples appear to be required to 
obtain reliable estimates of this model (e.g. Wainer and Thissen, 1982). We 
expect this problem to be exacerbated for the four-parameter model suggested 
by McDonald (1967). This model introduces an extra parameter to capture 
that even extremely able examinees commit errors. 
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Latent variable rjj 

Figure 3.5 Item characteristic curves for one-parameter logistic model with different 
b 

dent from Figure 3.5 that the one-parameter model has a property sometimes 
called 4 double-monotonicity 5 : for each ability, performance decreases with dif¬ 
ficulty and for each difficulty, performance increases with ability, i.e. items 
and subjects are strictly ordered. 

If the 7]j in the one-parameter model are taken as fixed parameters and a 
logit link is used, the famous Rasch model (Rasch, 1960) is obtained. This 
model has a number of attractive theoretical properties (e.g. Fischer, 1995). 
For instance, the Rasch model is equivalent to the requirement that the un¬ 
weighted sum-score of the responses is a sufficient statistic for r]j given the 
item-parameters bi ，This implies that conditional maximum likelihood estima¬ 
tion can be used for the item-parameters (see Section 6.10.2). Furthermore, 
the Rasch model is equivalent to a particular notion of generalizability of 
scientific statements, dubbed ‘specific objectivity 5 by Rasch (1967). Broadly 
speaking, specific objectivity means that comparison of the ability of two sub¬ 
jects should only depend on the ability of these subjects (and not the ability 
of others) and that the comparison should yield the same result whatever item 
the comparison is based on. 

As a development of the Rasch model, Fischer (1977, 1995) suggested the 
linear logistic test model where the item parameters in the Rasch model are 


The factor loadings ai in the two-parameter item response model are often 
constrained equal, and without loss of generality set to 1, giving 

v ij = Vj — 

a one-parameter model. In the logistic case the one-parameter model is often 
abbreviated as 1-PL. Note that a one-parameter item response model is just 
a random intercept model for dichotomous items without covariates. It is evi- 
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structured in terms of item-specific covariates 


= x ^/3- 

The nominal response model (e.g. Rasch, 1961; Bock, 1972; Samejima, 1972; 
Andersen, 1973) handles polytomous responses such as answers a s = 1,..., 
to multiple-choice questions. It is a multinomial logit model 





Ef=i exp ( 呤 ) 


where 

呤 =/31 + Af^. 

Identification is obtained, for instance, by setting /3} = 0, A! =0 and Var(r^)= 
1. 

Famous models for ordinal responses in item response theory, including the 
partial credit model (Masters, 1982) and rating scale model (Andrich, 1978), 
can be obtained by imposing restrictions in the nominal response model (e.g. 
Thissen and Steinberg, 1986). In the partial credit model the linear predictor 
is given as 

v lj = 0! + sVj, 

where equidistant category scores s are substituted for the unknown factor 
loadings of the nominal response model. The difference A| = — /3® _1 ) is 

sometimes called the 4 item step difficulty’ associated with category 5. 

The rating scale model is a special case of the partial credit model where 


The intercept is split into item-specific components Si and category-specific 
components r s . 


3.4 Latent class models 


In latent class models the units are assumed to belong to one of C discrete 
classes c 二 1，…， C where class membership is unknown. Thus, the classes 
can be viewed as the categories of a categorical latent variable. The (prior) 
probability that a unit j is in class c ， 7r JC , is a model parameter. 

Latent class models are traditionally used when dichotomous or polytomous 
responses i are observed on each unit j. If unit j is in class c，the conditional 
response probability that item i takes on the value a s , s = 1,..., is modeled 
as a multinomial logit 


Pr(y y = a s |c) 


ex PKjJ 

Efl 1 exp(4 c ) 


The responses to the items are assumed to be conditionally independent given 
membership in a given latent class. This is analogous to item response and 
factor models where the responses are conditionally independent given a con¬ 
tinuous latent trait. 
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The unconditional response probabilities become finite mixtures 

c 

Pr(yy = a s ) = ^2 7r ic Pr (2/ij = a s |c), 

C=1 

and the probability of a response pattern = (yij ,... ,yij) is 
c i 

Pr (yi) = 

c=l i=l 

In the conventional exploratory latent class model (e.g. McCutcheon, 1987), 
the linear predictor for item i and category 5 is a free parameter, 

4c = /3|c， 

with the constraint = 0. In Section 9.3 we estimate an exploratory latent 
class model for four dichotomous diagnostic tests for myocardial infarction. 
The two classes represent patients with and without myocardial infarction and 
the parameters in the linear predictor relate to the sensitivities and specificities 
of the tests. 

In the case of ordinal responses, (3f c is often structured as /3f c = b s f3 c for 
some scores b s , giving the adjacent category logit if b 8 = s. Other parameter- 
izations are also possible, see Section 2.3.4. Confirmatory latent class models 
impose restrictions on the parameters, typically setting some conditional re¬ 
sponse probabilities equal across latent classes. Latent class models can be 
formulated as log-linear models for contingency tables where one of the cate¬ 
gorical variables, the latent class variable, is unobserved (e.g. Goodman, 1974; 
Haberman, 1979). 

Finite mixture models have the same structure as latent class models. For 
continuous responses and counts, these models are often used when there is 
only one response per unit to obtain a flexible model for the probability dis¬ 
tribution. In the case of multivariate continuous responses, the conditional 
response distribution is often specified as multivariate normal, thus relax¬ 
ing the usual conditional independence assumption. Such model-based cluster 
analysis is discussed in Banfield and Raftery (1993) and Bensmail et al. (1997). 

3.5 Structural equation models with latent variables 

Measurement and factor models are important in their own right but also 
as building blocks in structural equation models where the relations among 
latent variables are modeled. These relationships are often of main scientific 
interest whereas the relationships between the observed items and the la¬ 
tent variables are of secondary interest. An important advantage of modeling 
the relationships among latent variables directly is that detrimental effects 
of measurement error, such as regression dilution (e.g. Rosner et al” 1990), 
may potentially be corrected (see for example Fuller, 1987 and Carroll et al., 
1995a). 
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Consider the simple 4 errors in variables 5 problem where a single covariate 
is measured with error according to a conventional measurement model 

Xj = + Sj, (3.28) 

with reliability p < 1. We wish to study the regression of yj on the true 
covariate (the platonic score) 匕， 

Vj = 7o + 7iO + 暴 . (3-29) 

and are particularly interested in the regression parameter 71 . However, using 
conventional regression, we have to rely on the regression on the observed but 
error prone covariate Xj instead, 

Vj = 7o + li x i + Cj - 

In this simple case it can be shown that the consequence of ignoring measure¬ 
ment error is that the estimated regression parameter is attenuated relative to 
the true regression parameter 

E(7i*) = 7iP- 

When there are several covariates measured with error the consequences are 
less clear cut. 

Importantly, 71 can be consistently estimated by jointly modeling (3.28) 
and (3.29), a simple example of a structural equation model. It should also 
be noted that the attenuation problem does not arise under the Berkson mea¬ 
surement model or when only the response variable is measured with error. 

We now describe traditional structural equation modeling with latent vari¬ 
ables, also often referred to as covariance structure analysis. As the latter 
term suggests, interest focuses on the covariance structure whereas the mean 
structure is typically eliminated by subtracting the mean from each variable. 
Having defined common factor models, a structural model specifying relations 
among the latent variables can be constructed. In this structural model, there 
could be both latent dependent variables and latent explanatory variables. As 
an example, consider a structural equation model for two latent dependent 
variables ⑽， r] 2 j and two latent explanatory variables 〜， The measure¬ 
ment model for the dependent variables is specified as an independent clusters 
model where the latent dependent variables are each measured by three non¬ 
overlapping items as in (3.27), written in vector notation as 

yj = 

Similarly, the measurement model for the explanatory variables can be written 
as 

Xj = -|- Sj . 

Note that we have omitted the constants assuming that both and are 
mean-centered. 

We now specify a structural model letting both latent dependent variables 
be regressed on both latent explanatory variables. In addition, one latent 
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The structural model becomes 


which can be written compactly as 

Vj = Br ?j + Cj- (3-32) 

If there is an observed covariate, it can be included by introducing an ar¬ 
tificial 4 latent 5 explanatory variable, with factor loading equal to one for the 
covariate and zero for all other observed variables and setting the unique fac¬ 
tor variance to zero. A disadvantage of this approach is that the explanatory 
variables are in effect treated as response variables so that the assumption 
of multivariate normality is often invoked. This is obviously unreasonable for 
many continuous covariates and even more so for dichotomous covariates such 
as gender. Some frameworks, for instance that of Muthen (1984), include an 
extra term Txij for regressions of latent variables on observed covariates: 

Vj = a + B^- + Txij- + (3.33) 

where a is an intercept vector. Muthen specifies the model conditional on 
the covariates so that distributional assumptions are not required for the co¬ 
variates. In the measurement model, the additional term Kx 2 j is included by 
Muthen and Muthen (1998) to represent regressions of observed responses on 
observed covariates 

Yj = u -\r Arjj -\-Kx 2 j ej, (3.34) 

where i/ is a vector of intercepts (often xij =X 2 j). 

A popular structural equation model with observed covariates is the Multiple- 
Indicator Multiple-Cause (MIMIC) model, a one-factor model where the factor 
is measured by multiple indicators and regressed on several observed covari¬ 
ates or c causes 5 (e.g. Zellner, 1970; Hauser and Goldberger, 1971; Goldberger, 
1972). Here the structural model is simply 

Vj = a + Vxii + O- 

A path diagram of a MIMIC model with three indicators and three covariates 
is shown in Figure 3.7. 

Robins and West (1977) considered a MIMIC model for handling measure¬ 
ment error in the estimation of home value. Three measures of home value 
were used in the measurement part of the model; 1 appraised value by a pri¬ 
vate firm' ‘estimated value by owner，and 4 assessed value by county for tax 
purposes’. In the structural part home value was regressed on twelve property 
characteristics，including ‘construction grade’， 4 type of garage，and 4 finished 
area’. For MIMIC models with several factors we refer to Robinson (1974). 

Returning to the general model in equation (3.33), the structural model can 


+ 
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Figure 3.7 Path diagram of MIMIC model 


be solved for the latent variables giving 

rij = (I-Bj-^a + Txy + C,-]. (3.35) 

Substituting 77 ^ into equation (3.34) gives the reduced form 

y j = iv + A(I -B) _1 [a + + Cj] + Kx 2j - + ej. 

The conditional expectation structure given xij and X 2 ) becomes 
E(y J -|xy,X 2 J *) = 1 / + A(I - B) _1 [q ： + Txij] + Kx 2i , 
and the conditional covariance structure becomes 

n^Coviy^j^j) = A(I-B)- 1 ^(I-B)- 1/ A / + 0, 

where 屯 is the covariance matrix of Cj and © the covariance matrix of €j. 
See Chapter 5 for further examples of traditional structural equation models 
with latent variables. 

Structural equation models are often given causal interpretations. For in¬ 
stance, Goldberger (1972) talks about ‘causal links’ in this context, and the 
methodology was previously often called 4 causal modeling’ (e.g. Bentler, 1978, 
1980; James et al .， 1982). The causal parlance attached to simultaneous equa¬ 
tions is undoubtedly a major reason both for the attractiveness of these kinds 
of models among social scientists and the scepticism from many statisticians. 
Causal interpretations should of course be conducted with extreme caution in 
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the context of observational designs as has been stressed by Guttmann (1977), 
Cliff (1983), Freedman (1985, 1986, 1992), Holland (1988) and Sobel (1995), 
among others. 

The specification of structural equation models and drawing of correspond¬ 
ing path diagrams is nevertheless indispensable for reasoning about how causal 
processes operate. For instance, in epidemiology a simple path diagram will of¬ 
ten reveal which variables are best treated as confounders (and 4 controlled ， for) 
and which variables should be treated as intermediate variables in the causal 
pathway (and not ‘controlled’ for). It is important to note that the structural 
equation models discussed here are closely related to graphical models and 
models for potential outcomes (see Greenland and Brumback, 2002). Pearl 
(2000) provides a lucid discussion of modern ‘causal modeling’. 

3.6 Longitudinal models 

Longitudinal data, often called repeated measurements in medicine, panel data 
in the social sciences and cross-sectional time-series data in economics, arise 
when units provide responses on multiple occasions. Two important features of 
longitudinal data are the clustering of responses within units and the chrono¬ 
logical ordering of responses. A typical problem is to investigate predictors of 
the overall levels of the responses as well as predictors of changes in the re¬ 
sponses over time. Longitudinal designs allow the separation of cross-sectional 
and longitudinal effects, as demonstrated in Section 1.4. 

In addition to accommodating the mean structure, longitudinal models must 
also allow for dependence among responses on the same unit. The reasons 
for this dependence include unobserved heterogeneity between units inducing 
within-unit dependence (as in two-level models) as well as unobserved time- 
varying influences inducing greater dependence between responses occurring 
closer together in time. Both types of unobserved heterogeneity can be mod¬ 
eled explicitly in an attempt to explain the conditional covariance structure 
(given the covariates). 

In the following subsections we distinguish between two types of longitudinal 
data; data with balanced or unbalanced occasions. The occasions are balanced 
if all units are measured at the same sets of time points i = 1,..., / 
and unbalanced if different sets of time points, Uj，i = 1 ,...,are used 
for different units. Missing data are possible in either case. If different units 
are measured at different sets of time points, but at each time-point there 
are measurements from a considerable number of units, the occasions can be 
treated as balanced with missing data. 

In the case of either balanced or unbalanced occasions, longitudinal data 
can be thought of as two-level data with occasions i at level 1 and units j at 
level 2. In the case of balanced occasions, the data can also be viewed as single- 
level multivariate data where responses at different occasions are treated as 
different variables. In this case models for the mean and covariance structure 
can include occasion-specific parameters, for instance occasion-specific resid- 
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ual variances. However, in the case of unbalanced occasions the mean and 
covariance structures are typically modeled as a function of the time associ¬ 
ated with the occasions or as a function of time-varying covariates. 


3.6.1 Models with unit-specific effects 

The primary reason for collecting information at multiple occasions for each 
unit is that it allows investigation of change within individual units; unit- 
specific (constant over time) effects can be controlled for, and we can investi¬ 
gate between-unit variability in the nature and degree of change or growth over 
time. Perhaps the most natural approach to longitudinal modeling is there¬ 
fore to model individual growth trajectories using a combination of common 
(across units) fixed effects to summarize the average features of the trajecto¬ 
ries and unit-specific fixed or random effects to represent variability between 
units. The random effects then induce and hence explain the conditional co- 
variance structure. 

Fixed effects models 

Consider the response yij of unit j on occasion i. A simple linear fixed effects 
model for longitudinal data is of the form 

Vij = + a j + (3.36) 

where x^- are time-varying covariates, sometimes including a time variable Uj, 
with regression parameters /3, aj are unit-specific intercepts or 4 fixed effects’ 
and Cij are identically and independently normally distributed residuals with 
E(eij\-x.ij) = 0. The fixed effects aj represent unit-specific effects that, if ig¬ 
nored, could lead to confounding and induce dependence among the residuals 
producing bias. 

As a consequence of including aj in the model, the effects /3 are interpretable 
as within-unit effects. This can be seen by considering the cluster means of 
model (3.36 )， 

Vj = x；/3 + 4% (3-37) 

where the responses are the means for the units (over occasions). With a 
separate parameter aj for each response, giving a saturated model, the re¬ 
sponses do not provide any information on /3. Subtracting (3.37) from (3.36), 
we obtain the within-unit regression model 

Vij ~ Vj = ~ ^-jYP + e ij ~ (3.38) 

which eliminates ctj. 

Estimates of /3 and can be obtained from ordinary least squares (OLS) 
estimation of the fixed effects model (3.36)，which simultaneously produces 
the estimates /3 FE and aj. Alternatively, and equivalently, OLS estimation of 
(3 may be based on the within-unit model (3.38) producing 

3w = w-^w^, 
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Display 3.3 Common dependence structures for longitudinal data. 


A. Random intercept structure: 

• 料 0 

ip 

Cl = #11’ 61 = . . • 

_ ^ ^ ... ^-\-0 

B. Random coefficient structure: 

^lj = + OInj 



ii. Multidimensional factor: 


= A 中 A，+ ©. 



where W xx = - %)( x 0 — 心 )’ and - 文 j)iVij _ %)• 

The unit-specific intercepts are subsequently estimated as 

s j = Vj - x^w- 

If it is assumed that e^- is normally distributed, it can also be shown that /3 W 
is obtained from conditional maximum likelihood estimation given yij. 
In practice, where the number of occasions n is fixed and of a moderate mag¬ 
nitude, /3 W is a best linear unbiased estimator and consistent, whereas aj is 
inconsistent as N — oo. 

In univariate repeated measures ANOVA, or ANOVA for a balanced split 
plot design (e.g. Hand and Crowder, 1996), within-unit effects can be esti¬ 
mated using the above model whereas between-unit effects can be estimated by 
specifying an ordinary linear model for the mean responses yj. The first anal¬ 
ysis corresponds to a partitioning of the within-unit sums of squares whereas 
the second uses the between-unit sums of squares. 

A problem with the fixed effects model is that regression parameters for 
time-constant covariates such gender or treatment group where x^- = are 
not identified; see (3.38). Conditional maximum likelihood estimation of fixed 
effects models is also discussed in Section 6.10.2. 

Random intercept models 

Instead of treating the unit-specific effects as fixed, we can assume that the 
effects are realizations of a random variable Q, 

Vij = + Cj + e ij 5 

where Q and are independently distributed G 〜 N(0，^) and e 勿〜 N(O ， 0). 
This random intercept model is often called a 4 one-way error component 
model’ in econometrics. 

The random intercept or ‘permanent component 5 Q allows the level of the 
response to vary across units. An advantage of this model compared with the 
fixed effects model is that the between-unit model is no longer saturated and 
we can include time-constant covariates. However, these advantages are pur¬ 
chased at the cost of relying on several assumptions, such as zero correlations 
between the random intercept and the covariates (see Section 3.2.1). 

The maximum likelihood estimator of /3 under normality of Q and Cij can¬ 
not be expressed in closed 八 form. We will instead discuss the generalized least 
squares (GLS) estimator /3 GLS , since it can be written in closed form and is 
asymptotically equivalent to the maximum likelihood estimator. This estima¬ 
tor also has the advantage that normality of Q and need not be assumed. 

Although the within-estimator f3 w in (3.6.1) is an unbiased and consistent 
estimator of f3 in the random intercept model, the GLS estimator /3 GLS is 
a best linear unbiased estimator (BLUE). The GLS estimator is a matrix 
weighted average of the within-estimator /3 W and the between-estimator f3 B , 
where the weights are the inverses of the covariance matrices of the respective 
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estimators. The between-estimator is the OLS estimator 

Pb = 

for /3 in the between-unit model 

Vj~y = (x.-xy^+fe-e), (3.39) 

where B xx = - x)(xj - x) 7 and B xy = - ^)(Vj - V)- Note 

that the between-estimator only uses variation between units and ignores the 
additional information from the longitudinal design as compared to a cross- 
sectional design. 

The GLS estimator (e.g. Maddala, 1971) can then be expressed as 
/^gls = Vw/3 W + V_b/3 b ， 
where the weight matrices are 

Vs = I 一 V 贶， 

and 

_ e 

0 十 my 

The GLS estimator can alternatively be written as 

々 gls = (W xx + to^ xx )~ 1 {W xy + cjB xy ), 
from which we see that uj essentially represents the weight given to the between- 
unit variation. In fixed effects OLS, a； = 0 and this source of variation is ig¬ 
nored. OLS for a ‘naive’ model without unit-specific effects corresponds to 
a; = 1 so that all between-unit variation is added to the within-unit variation. 
Treating the unit-specific effects as random thus provides an intermediate ap¬ 
proach between these extreme treatments of the between-unit variation. Also 
note that 1 — a; corresponds to the shrinkage factor to be discussed in Sec¬ 
tion 7.3.1; see (7.5). In the special case of balanced occasions, no missing data 
and balanced covariates x^- = we obtain f3 B = 0. It follows that the GLS 
estimator in this case is identical to the between-estimator /3 W . 

The conditional variances of the responses , or the variances of the total 
residuals = Q + Cij, are equal to -0 + 0 and constant across occasions. The 
conditional covariances for any two occasions are just ^ and the corresponding 
correlation is the intraclass correlation previously encountered. This random 
intercept covariance structure is shown in Display 3.3A on page 82. Note that 
this covariance structure is the special case of a one-factor structure where 
= 1 and 6a = 6 for all i. The covariance structure is sometimes referred 
to as exchangeable since the joint distribution of the residuals for a given 
person remains unchanged if the residuals are exchanged across occasions. 
The covariance structure is also consistent with the sphericity assumption 
that the conditional variances Var — y^j | x^-) of all pairwise differences are 
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equal. Note that the covariances ^ are restricted to be nonnegative in the 
random intercept model. If this restriction is relaxed, the above configuration 
of the covariance structure is often called compound symmetric. In the case 
of balanced occasions, we can allow the variance of e 勿 to take on a different 
value for each occasion, 6a ， 

Random coefficient models - Growth curve models 

The random coefficient model (e.g. Laird and Ware, 1982) allows both the 
level of the response and the effects of covariates to vary randomly across 
units. The model was previously specified as a two-level model in (3.12)， 

Vij = Xy/3 + + dj, 

where i = 1， 2, • • • ， n). Here, x^- denotes both time-varying and time-constant 
covariates with fixed coefficients /3 and z^- time-varying covariates with ran¬ 
dom coefficients Since the random coefficients have zero means, x^- will 
typically contain all elements in z^-, with the corresponding fixed effects inter¬ 
pretable as the mean effects. The first element of the vectors is typically equal 
to one corresponding to a fixed and random intercept. Letting 屯三 Cov(C J ), 
the covariance structure of the vector is presented in Display 3.3B on 
page 82. The special case where the residual variances are set equal across 
occasions, 6u = 9^ is common. 

A useful version of the random coefficient model for longitudinal data is a 
growth curve model where individuals are assumed to differ not only in their 
intercepts but also in other aspects of their trajectory over time, for example in 
the linear growth (or decline) of the response over time. These models include 
random coefficients for (functions of) time. For example, a linear growth curve 
model can be written as 

Vij = AjP + Coj + CijUj + eii, (3.40) 

where tij, the time at the ith occasion for individual is one of the covariates 
in Xij. The random intercept and slope should not be specified as uncorre- 
lated，because translation of the time scale Uj changes the magnitude of the 
correlation as illustrated in Figure 3.1 on page 54 (see also Elston (1964) and 
Longford (1993)). 

The covariance structure is the same as for the two-level random coefficient 
model in (3.15)，shown explicitly for the variance in (3.11). A path diagram 
of this model is shown in the first panel of Figure 3.8 on page 86, where there 
are three occasions with times = 0, 亡 2 = 1 and ts = 2. The second diagram 
represents the unbalanced case, where all variables inside the box labelled 
c unit f have a j subscript and vary between units. Variables that are also in 
the box labelled 4 occasion V vary between occasions and units and have both 
an i and j subscript. The arrow from t to y therefore represents a regression 
of yij on tij. The latent variable Cij modifies this regression or interacts with 
Uj and therefore represents the random slope. 

In the case of balanced occasions, the linear growth curve model can also 
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be formulated as a two-factor model, 

Vij = ^OiVOj + + e ij- 

Here 

Voj = Po + Coj, Vij = /?i + Cij， 

the loadings for the intercept factor rjQj are fixed to Ao《=1 and the loadings 
for the slope factor rjij are set equal to U ，Note that the means of the factors 
cannot be set to zero here as is usually done in factor models. 

Meredith and Tisak (1990) suggest using a two-factor model similar to that 
in the first diagram of Figure 3.8 but with free factor loadings for rj\j (subject 
to identification restrictions, such as An = 0 and A 12 = 1) to model nonlinear 
growth. Traditionally, estimation of this factor model would require balanced 
occasions without missing data, but this is no longer a limitation. However, 
if the occasions are very unbalanced, with few responses at a given occasion, 
factor models can no longer be used since reliable estimation of occasion- 
specific factor loadings would be precluded. 


Balanced occasions 


Unbalanced occasions 




Figure 3.8 Path diagrams for growth curve models with balanced and unbalanced 
occasions 

The models can easily be extended to noncontinuous responses by using 
generalized linear mixed models. We will estimate a random coefficient model 
for longitudinal count data in Section 11.3. 

Longitudinal models with discrete latent variables - latent trajectory models 

It is sometimes believed that the population consists of different types or 
classes of units characterized by different patterns of development or devel¬ 
opment trajectories over time. The models are latent class models having the 
same form as the latent growth models discussed above except that the ran¬ 
dom effects are now discrete. 
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For instance, in a linear latent trajectory model analogous to (3.40), the 
linear predictor for a unit in class c is given by 

^ijc = ^0c ei c tij. 

Each latent class is therefore characterized by a pair of coefficients eo c and 
eic, representing the intercept and slope of the latent trajectory. For balanced 
occasions, we do not have to assume that the latent trajectories are linear or 
have another particular shape but can instead specify an unstructured model 
with latent trajectory 


^ijc = ^ic'i f 二 1， • . . ， I 

for class c, c — 1,..., (7. In the case of categorical responses, latent trajectory 
models are typically referred to as latent class growth models (e.g. Nagin 
and Land, 1993). They are an application of mixture regression models (e.g. 
Quandt, 1972) to longitudinal data. 

If the responses are continuous, the models are known as latent profile 
models (e.g. Gibson, 1959), 


Vijc 


Here the variance of the residuals €ij C could be allowed to differ between 
classes. Both latent class and latent profile models assume that the responses 
on a unit are conditionally independent given latent class membership. Muthen 
and Shedden (1999) relax this assumption for continuous responses in their 
growth mixture models by allowing the residuals e^ c to be correlated condi¬ 
tional on latent class membership with covariance matrices differing between 
classes. 


3.6.2 Models with correlated residuals 

Random intercept models include two random terms, the random intercept 
and the occasion-specific residual. While the random intercept represents ef¬ 
fects of random influences or omitted covariates that remain constant over 
time, the residuals represent effects of random influences that are immediate 
and do not persist over more than a single occasion. The resulting compound 
symmetric correlation structure often does not reflect what is observed in 
practice, namely that (conditional) correlations between pairs of responses 
tend to be greater if the responses occurred closer together in time. 

Such correlation structures can be induced by allowing the effects of omit¬ 
ted variables to be distributed over time, leading to autocorrelated errors. It 
should be noted that this omitted variable interpretation requires that the 
total effect of the influences represented by averages out to zero over units 
and also that it is uncorrelated with x^- (e.g. Maddala, 1977). In the following 
subsections, we discuss the case of continuous responses, sometimes indicating 
how the models are modified for other response types. 
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Autoregressive residuals 

When occasions are equally spaced in time, a first order autoregressive model 
AR(1) can be expressed as 

^ij = + Sij, (3.41) 

where Ci-ij is independently distributed from the ‘innovation errors’ 5小 
5ij 〜 N(0, (j|). This is illustrated in path diagram form in the first panel 
of Figure 3.9. Note that a ‘random walk’ is obtained if a = 1 in the AR(1) 
model. 


AR(1) residuals MA(1) residuals AR(1) responses 



Figure 3.9 Path diagrams for autoregressive responses and autoregressive and moving 
average residuals 

Assuming that the process is weakly stationary, \a\ < 1, the covariance struc¬ 
ture is as shown in Display 3.3C on page 82. It follows that the correlations 
between responses at different occasions are structured as 

Cor(eij,e i+k ,j) = OL k . 

For non-equally spaced occasions, the correlation structure is often specified 
as 



where the correlation structure for unbalanced occasions is simply obtained 
by replacing U by Uj (e.g. Diggle, 1988). 

These first order autoregressive covariance structures are often as unre¬ 
alistic as compound symmetry since the correlations fall off too rapidly with 
increasing time-lags. One possibility is to specify a higher order autoregressive 
process of order k, AR(k), 

^ij = H - OC2^i—2,j + . . • + Oik^i—k,j 谷 ij. 

Another is to add a random intercept to the AR(1) model (see ‘Hybrid speci¬ 
fications 5 on page 91). In the case of balanced occasions, we can also specify a 
different parameter oti for each occasion, giving an antedependence structure 
(e.g. Gabriel, 1962) for the residuals. 
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Moving average residuals 

Random shocks disturb the response variable for some fixed number of peri¬ 
ods before disappearing. Such a process can be modeled by means of moving 
averages (see Box et al, 1994). A first order moving average process MA(1) 
for the residuals can be specified as 

^ij = Sij + • 

A path diagram for this model is given in the second panel of Figure 3.9 and 
the covariance structure is presented in Display 3.3D on page 82. We see that 
the process ‘forgets’ what happened more than one period in the past, in 
contrast to the autoregressive processes. 

The moving average model of order k，MA(k)，is given as 

€ij = ^ij CllSi—ij + 2 ,j + • • • + Ctk^i—k,j ? 

with c memory，extending k periods in the past. 

3.6.3 Models with lagged responses 

In these models lags of the response yij are included as covariates in addition 
to Xij. The dependence on previous responses is called ‘state dependence’； see 
Section 9.6 for elaboration and an application. The models are also referred 
to as transition models (e.g. Diggle et al” 2002, Chapter 10). When occasions 
are equally spaced in time, a first order autoregressive model for the responses 
yij can be written as 

Vij = + lVi-1,3 + • 

Assuming that the process is weakly stationary, | 7 | < 1, the covariance struc¬ 
ture is shown in Display 3.3E on page 82. A path diagram for this model is 
shown in the third panel of Figure 3.9. 

As for the residual autoregressive structure, the first order autoregressive 
structure for responses is often deemed unrealistic, since the correlations fall 
off too rapidly with increasing time-lags. Once again, this may be rectified by 
specifying a higher order autoregressive process AR(k) 

Vij = xb/3 十，沾 .+ j2Vi-2,j + ■•• + IkVi-Ki + £«• 

An extension of the autoregressive model is the antedependence model for 
responses which specifies a different parameter 7 ^ for each occasion. 

Apart from being of interest in its own right, the lagged response model is 
useful in distinguishing between different longitudinal models. Consider two 
simple models; a state dependence model with a lagged response and lagged 
covariate but independent residuals 

Vij = + PiXij + /32Xi-ij + €小 (3.42) 

and an autocorrelation model without lagged response or lagged covariate 

Vij = I^Xij Cij ， 
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but residuals having a AR(1) structure. Substituting first for e^- = aei_i，j+ 
Sij from (3.41)，then for = Vi-ij — /3xi_i,j and reexpressing, the auto¬ 
correlation model can alternatively be written as 

Vij = ocyi-^j + (3xij + a(3xi-ij + 5ij. 

Note that this model is equivalent to the state dependence model (3.42) with 
the restriction j/3i + /?2 = 0. Importantly, this means that we can use the 
state dependence model to discriminate between autocorrelated residuals and 
state dependence in longitudinal models. The distinction between true and 
4 spurious’ state dependence (apparent state dependence that disappears when 
appropriately modeling residual dependence) is crucial in many applications, 
see Section 9.6 for an example. It also follows that the ritual of performing 
a Durbin-Watson test (Durbin and Watson, 1950) for autocorrelation in lon¬ 
gitudinal modeling should be preceded by ruling out the state dependence 
model. Otherwise, a large Durbin-Watson statistic is ambiguous, indicating 
state dependence and/or autocorrelation. 

Use of lagged response models should be conducted with caution. First, lags 
should be avoided if the lagged effects do not have a ‘causal’ interpretation 
since the interpretation of (3 changes when yi-ij is included as an additional 
covariate. Second, the models require balanced data in the sense that all units 
are measured on the same occasions. If the response for a unit is missing at 
an occasion, the entire unit must be discarded. Third, lagged response models 
reduce the sample size. This is because the yij on the first occasions can only 
serve as covariates and cannot be regressed on lagged responses (which are 
missing). Fourth, an initial condition problem arises for the common situation 
where the process is ongoing when we start observing it (e.g. Heckman, 1981b). 

An advantage of lagged response models as compared to models with au¬ 
toregressive residuals is that they can easily be used for response types other 
than the continuous. 


3.6.4 Other covariance structures 
Unrestricted 

Instead of attempting to model the covariance structure, we can simply specify 
Vij = + 

where the vector of residuals €j is multivariate normal with an unrestricted 
covariance matrix. This unrestricted model requires balanced data with rij = I 
and this specification corresponds to a repeated measures MANOVA (e.g. 
Hand and Crowder, 1996, Chapter 2). 

This approach provides a safeguard against false specifications of the de¬ 
pendence among the responses within a unit, the only assumption being that 
the responses of all units are multinormal with the same residual covariance 
matrix. However, the specification requires large sample sizes when there are 
many occasions I since / x (/ + 1)/2 covariance parameters need to be esti- 
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mated along with the regression coefficients. Estimation of the unrestricted 
model is also inefficient if structured versions are valid. 

Factor models 

We can induce dependence between responses by including factor structures 
in the linear predictor. This approach can also be useful for generalized linear 
mixed models where we often cannot freely specify conditional correlations 
(e.g. Rabe-Hesketh and Skrondal, 2001). 

For a continuous response, a one-factor model for the residual is specified 
as 

Vij = AjP + X iVj + ^ij, 

where x^- denotes covariates with fixed coefficients /3, a factor loading 
for occasion i and a factor . 心三 A 冲 + ey can be viewed as the total 
residual. Note that use of the model requires a certain degree of balance, since 
a factor loading is estimated for each occasion. The covariance structure of y^, 
called a factor structure^ is given in 3.3F.i. on page 82. Note that the random 
intercept model arises as the special case where A; = 1 and the restricted 
random intercept model (producing compound symmetry) results when the 
additional restrictions 6u = 6 are imposed. For three occasions 7 = 3, the 
one factor model is equivalent to the unrestricted model. Special cases for 
1 = 3 include a stationary first-order autoregressive residual process and a first 
order moving average residual process with a random intercept (e.g. Heckman, 
1981c). 

A multidimensional factor model for the residuals can be specified as 
Vij = x^./3 ++ 

where x^- denote covariates with fixed coefficients /?, and Ai a vector of factor 
loadings for occasion i and factors. The multidimensional factor structure 
is shown in Display 3.3Fii on page 82. In the longitudinal setting, in contrast 
to measurement modeling, we believe that the choice between exploratory or 
confirmatory factor models should be made on a pragmatic basis, since no 
meaning is attributed to the factor. 

For nonnormal responses, the multidimensional factor model is specified for 
the linear predictor as 

v ij — + Xi/Hj. 

Hybrid specifications 

The different dependence specifications we have surveyed can be combined. 
A famous example is the ARMA model which combines autoregressive and 
moving average models. Another possibility is to combine the random inter¬ 
cept model with a first order autoregressive process for the responses (e.g. 
Joreskog, 1978), or with a first order autoregressive process for the residuals 
(e.g. Diggle, 1988), thereby relaxing the conditional independence assumption 
usually made in multilevel models. Other approaches include ARIMA models, 
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where differencing is used in order to obtain stationarity (e.g. Box et al” 1994) 
and their special cases. 


3.6.5 Generalized estimating equations for nonnormal responses 

Most models discussed so far are based on the notion that the dependence 
among responses (conditional on the covariates) can be modeled and in some 
sense explained by latent variables. For instance，in growth curve modeling, 
the random effects capture individual differences in growth trajectories and 
simultaneously induce residual dependence. 

A radically different approach is to focus on the mean structure and rele¬ 
gate the dependence to a nuisance，by using generalized estimating equations 
(GEE) (e.g. Liang and Zeger ， 1986; Zeger and Liang, 1986); see Section 6.9. 
The simplest version is to estimate the mean structure as if the responses 
were independent and then adjust standard errors for the dependence using 
the so-called sandwich estimator (see Section 8.3.3). The parameter estimates 
can be shown to be consistent, but if the responses are correlated, they are 
not efficient. To increase efficiency a c working correlation matrix’ is therefore 
specified within a multivariate extension of the iteratively reweighted least 
squares algorithm for generalized linear models (see Section 6.9 for details). 
Typically, one of the structures listed in Display 3.3 is used for the working 
correlation matrix of the residuals yij — ^ _1 (x^-/3), as well as unrestricted and 
independence correlation structures. The working correlation matrix is com¬ 
bined with the variance function of an appropriate generalized linear model, 
typically allowing for overdispersion if the responses are counts. It is impor- 
tant to note that, apart from continuous responses, the specified correlation 
structures generally cannot be derived from a statistical model. Thus, there 
is no likelihood and GEE is a multivariate quasi-likelihood approach. 

In general the regression coefficients estimated using GEE have a differ¬ 
ent interpretation than those of models including latent variables. The lat¬ 
ter represent the conditional effects of covariates given the latent variables, 
unit-specific effects in longitudinal settings. GEE，on the other hand, provides 
marginal or population averaged effects, where the individual differences are 
averaged over instead of modeled by latent variables. In probit and logistic re¬ 
gression the marginal effects tend to be attenuated (closer to zero) compared 
with the conditional effects, as was shown for the probit case in Figure 1.6 
on page 12. Differences between marginal and conditional effects also arise 
for other links and models with random coefficients，exceptions being models 
with an identity link and models with a log link and a normally distributed 
random intercept (see also Section 4.8.1). 

Note that there are also 4 proper 5 marginal statistical models with corre¬ 
sponding likelihoods. Examples include the Bahadur model (Bahadur, 1961) 
which parameterizes dependence via marginal correlations and the Dale model 
(Dale, 1986) which parameterizes dependence via marginal bivariate odds- 
ratios; see Fitzmaurice et al. (1993) and Molenberghs (2002) for introductions. 
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Whether conditional or marginal effects are of interest will depend on the 
context. For example, in public health, population averaged effects may be of 
interest, whereas conditional effects are important for the patient and clini¬ 
cian. Importantly, marginal effects can be derived from conditional models by 
integrating out the latent variables. Unfortunately, conditional effects cannot 
generally be derived from marginal effects. Conditional effects are likely to be 
stable across populations. However, if the conditional effect is the same in two 
populations, but the random intercept variance differs, the marginal effects 
will be different. 

Note that Heagerty and Zeger (2000) introduce latent variable models where 
the marginal mean is regressed on covariates as in GEE. In these models the 
relationship between the conditional mean (given the latent variables) and the 
covariates is found by solving an integral equation linking the conditional and 
marginal means (see equation (4.28) on page 123). Interestingly, the integral 
involved can be written as a unidimensional integral over the distribution of 
the sum of the terms in the random part of the model. 


3.7 Summary and further reading 

We have described classical latent variable models such as multilevel regression 
models, measurement models, exploratory and confirmatory factor models, 
item response models, structural equation models, latent class models and 
several models for longitudinal data. A unifying framework for these classical 
latent variable models, combining them with the response processes described 
in Chapter 2, is presented in the next chapter. Some classical latent variable 
models are employed in the Application part of this book, particularly in 
Chapter 9, although most applications are based on extended models. 

There are a large number of books on multilevel models, see for example 
(in approximate order of difficulty) Kreft and de Leeuw (1998), Hox (2002), 
Raudenbush and Bryk (2002), Snijders and Bosker (1999), Aitkin et al. (2004), 
Goldstein (2003), Longford (1993), Cox and Solomon (2002) and McCulloch 
and Searle (2001). 

We have only presented a simplified version of measurement theory, not go¬ 
ing into for instance generalizability theory (see e.g. Cronbach et al” 1972; 
Shavelson and Webb, 1991; Brennan, 2001). For introductory reading we 
recommend Streiner and Norman (1995). Intermediate treatments include 
Crocker and Algina (1986) and Dunn (2004). An advanced and authorita¬ 
tive treatment is provided by Lord and Novick (1968). Lawley and Maxwell 
(1971) and Mulaik (1972) are useful books on factor models for continuous 
responses whereas Bartholomew and Knott (1999) also consider dichotomous, 
polytomous and mixed responses. 

Books on item response theory include Lord and Novick (1968), Lord (1980), 
Hambleton and Swaminathan (1985), Hambleton et al. (1991), van der Lin¬ 
den and Hambleton (1997), Embretson and Reise (2000) and De Boeck and 
Wilson (2004). We have not discussed nonparametric item response theory 
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(e.g. Sijtsma and Molenaar, 2002) or unfolding (ideal point) models where the 
item characteristic curves are nonmonotonic (e.g. Coombs, 1964; Roberts et 
al, 2000). 

An introduction to latent class modeling is given by McCutcheon (1987) and 
a survey is given by Clogg (1995). Books on mixture models include Everitt 
and Hand (1981), McLachlan and Peel (2000) and Bohning (2000). 

Books on structural equation models include Dunn et al. (1993), Bollen 
(1989) and, in econometrics, Wansbeek and Meijer (2002). 

We have not discussed state-space models for longitudinal data (e.g. Jones, 
1993) or hidden Markov (latent transition) models (e.g. MacDonald and Zuc¬ 
chini, 1997; van de Pol and Langeheine, 1990). Useful books on modelling lon¬ 
gitudinal data include Hand and Crowder (1996), Crowder and Hand (1990), 
Hsiao (2002)，Baltagi (2001)，Diggle et al. (2002), Lindsey (1999) and Everitt 
and Pickles (1999). 
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CHAPTER 4 


General model framework 


4.1 Introduction 

The general model framework unifies and generalizes the multilevel, factor, 
item response, latent class, structural equation and longitudinal models dis¬ 
cussed in Chapter 3. 

In that chapter we were mainly concerned with models having continuous 
responses. Here we describe latent variable models accommodating all the 
response processes discussed in Chapter 2. As we shall see, and in contrast to 
the models in Chapter 3, random coefficients and factors can now be included 
in the same model. Latent variables are also allowed to vary at several levels, 
yielding for instance multilevel factor models. Multilevel structural equations 
can be specified to regress latent variables on same and higher level latent and 
observed variables. We will also relax the assumption of multivariate normality 
of the latent variables by using other continuous or discrete distributions or 
nonparametric maximum likelihood. Different kinds of latent class models are 
also accommodated. The model framework mostly corresponds to the class of 
Generalized Linear Latent And Mixed Models (GLLAMM) described in Rabe- 
Hesketh et al. (2004a); see also Rabe-Hesketh et al. (2001a). However, we also 
discuss model types not accommodated within that class such as multilevel 
latent class models. 

The essence of the general model formulation is the specification of hierar¬ 
chical conditional relationships: The response model specifies the distribution 
of the observed responses conditional on the latent variables and covariates 
(via a linear predictor and link function) and in the structural model the 
latent variables themselves may be regressed on other latent and observed co¬ 
variates. Finally, the distribution of the disturbances in the structural model 
is specified. Sections 4.2 to 4.4 of this chapter are therefore on: 

• the response model 

• the structural model for the latent variables 

• the distribution of the disturbances in the structural model 

An essential part of model specification concerns imposing appropriate re¬ 
strictions on model parameters. Hence, different types of parameter restric¬ 
tions are presented and the related notion of fundamental parameters intro¬ 
duced in Section 4.5. 

In order to fully understand a latent variable model, it is important to con¬ 
sider the moment structure of the observed responses, marginal with respect 
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to latent variables but conditional on observed covariates. To derive this, we 
start by deriving the reduced forms of the latent variables and the linear 
predictor in Section 4.6. We then derive the moment structure of the latent 
variables in Section 4.7 by integrating out the disturbances of the structural 
model. (This section and the previous are somewhat technical and might be 
skipped if desired.) Having obtained this moment structure, we derive the 
moment structure of the responses in Section 4.8 by integrating out the latent 
variables. This helps clarify the crucial distinction between conditional and 
marginal covariate effects in models with latent variables. Finally, we derive 
the reduced form distribution in Section 4.9, the conditional distribution of 
the observed responses given the explanatory variables, which represents the 
basis for the likelihood. The concept of reduced form parameters, which is 
important for the discussion of identification and equivalence in Chapter 5, is 
introduced in Section 4.10. 


4.2 Response model 

Conditional on the latent variables, the response model is a generalized linear 
model specified via a linear predictor a link function g(-) and a distribution 
from the exponential family 

= exp I 从:⑹ + c(yj, 0 )|, 

where 6i is a function of the mean = g -1 ^) and depends on latent 
variables. Any of the conditional densities for a generalized linear model can be 
specified for the responses, including the extensions introduced in Chapter 2. 
Models for scale parameters and thresholds may also be specified. Table 4.1 
lists the response types that can be handled and the application chapters 
discussing each type. 

In Section 4.2.1 we unify conventional random coefficient and factor mod¬ 
els, leading to a ‘generalized factor ， (GF) formulation of the general model 
described in Section 4.2.2. The GF formulation is multivariate] a matrix ex¬ 
pression specifies a vector of linear predictors for a multivariate response. This 
multivariate formulation is useful for deriving covariance structures of the ob¬ 
served responses (continuous case) or of the latent responses (dichotomous, 
ordinal or comparative case). 

The linear predictor can also be defined using the 4 generalized random co¬ 
efficient 5 (GRC) formulation described in Section 4.2.3. This formulation is 
univariate and resembles the univariate formulation of multilevel random co¬ 
efficient models described in Chapter 3. An important advantage of the GRC 
formulation is that it includes separate terms for the parameters and covari¬ 
ates, making the structure of the model more explicit than the GF formula¬ 
tion. In Section 4.2.4, both the GF and GRC formulations are used to specify 
a two-level factor model. Exploratory latent class models are specified using 
both formulations in Section 4.2.5. 
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Table 4.1 Response types handled and corresponding application chapters 


Response Type 

Chapter 

Continuous 


Dichotomous 

Chapter 9 

Ordinal 

Chapter 10 

Counts 

Chapter 11 

Durations 

Chapter 12 

Discrete time durations 


Continuous time durations 


Comparative 

Chapter 13 

Nominal 


Rankings 


Pairwise comparisons 


Mixed responses 

Chapter 14 


4-2.1 Unifying conventional random coefficient and factor models 

Conventional random coefficient and factor models, discussed in Sections 3.2 
and 3.3, are more similar than generally realized. Recall the random coefficient 
model from equation (3.14) 

Yj = + Zjtjj + ej, (4.1) 

and the measurement part of the structural equation model in equation (3.34) 
Yj = + Kxj) + Arjj + 6j, (4.2) 

where some subscripts and superscripts have been omitted to simplify nota¬ 
tion. 

Although different in interpretation, these models have a similar structure. 
In the random coefficient model y^- represents the vector of responses for the 
level-1 units within the jth. level-2 unit whereas y^- represents the items in the 
common factor model. To facilitate the subsequent development, we will refer 
to the elementary units i as level-1 units whether they are the lowest level units 
in a multilevel setting or items in a factor model. The clusters j in random 
coefficient models or units in common factor models are then level-2 units. The 
disturbances e) in random coefficient models correspond to unique factors in 
common factor models, henceforth referred to as 4 errors’. The random effects 
in random coefficient models correspond to the common factors in common 
factor models. We will use the term latent variables for either random effects 
or common factors r/). 

The design matrix Zj for the random effects corresponds to the factor load¬ 
ing matrix A. There are two differences between Zj and A. First, Zj is a 
known matrix of covariates and constants whereas A is an unknown param¬ 
eter matrix. Second, while Zj can differ between level-2 units j, whereas A 
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level-1 units 

level-2 units 

responses 

errors 

latent 

variables 

structure 

matrix 
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fixed part X,/3 fixed part 
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j units 

yj responses 

ej unique factors 

rjj common 

factors 

A factor loading 
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{y + Kxw) fixed part 


This model framework can also be used to formulate different kinds of latent 
class models if the latent variables are discrete. A mixture regression model, 
for instance a latent class growth model (see page 86)，is obtained simply by 
using discrete latent variables in a random coefficient model; see Section 9.5, 
page 304, for an example in met a-analysis. Section 4.2.5 shows how exploratory 
latent class models are formulated using this framework. 

Note that treating the items of a factor model or the variables comprising 
any multivariate response as level-1 units and the original units as level-2 clus¬ 
ters is a common approach in multivariate multilevel regression modeling (e.g. 
Goldstein, 2003, Chapter 6). An advantage of this approach is that missing 
responses then merely result in varying cluster sizes which can be handled 
by multilevel modeling software if responses are missing at random (MAR) 
(see Section 8.3.1 for types of missingness). The same approach is adopted 


is generally constant. Nevertheless, we will refer to both Zj and A as the 
structure matrix, denoted Aj. 

The fixed parts Xj/3 and (u + Kx^) in the two models can be used to 
specify the same mean structure. In the case of a single covariate, the terms 
for the zth row or ith. level-1 unit are Po + Xijf3 and ui + kiXj, respectively. 
Whereas the former assumes a constant effect /? of a level-1 specific covariate 
Xij, the latter assumes level-1 specific effects ki of a level-2 specific covariate 
Xj. However, this difference is superficial: In the random coefficient model, 
interactions with dummy variables for the level-1 units i can be used to allow 
coefficients to depend on i; In the factor model, different covariates can be 
used for different i to represent a level-1 unit-specific covariate. 

A response model unifying and generalizing both random coefficient and 
factor models can now be written as 

Yj ^jP + ^jVj + ej, (4-3) 

where the structure matrix Aj can contain both variables and parameters. 
The unifying notation and terminology is summarized in Display 4.1. 


Display 4.1 Unifying notation and terminology. 


Unified model Random coefficient model Factor model 


symbol term symbol interpretation symbol interpretation 


yi s ^ 


© 2004 by Chapman & Hall/CRC 





by Raudenbush and Sampson (1999ab) and Raudenbush and Bryk (2002) for 
one-parameter item response models and De Boeck and Wilson (2004) for 
two-parameter item response models. 


4-2.2 Linear predictor in generalized factor (GF) formulation 

The main advantage of considering the linear predictor is that all response 
processes considered in Chapter 2 are accommodated. 

The unified model in (4.3) can be written in generalized factor (GF) for¬ 
mulation by writing the vector of linear predictors for the responses on unit 
j as 

u j = X)/3 + ^-jVj 

and specifying an identity link and a normal density for y^- given Uj. 

Before introducing the multilevel extension, we will reintroduce the sub¬ 
scripts and superscripts for the levels of the model used in Section 3.2.1 to 
write the model as 

v j{2) = Xj+ ( 2) /3 +(4.4) 
Vectors with the j{2) subscript contain all elements for the jth level-2 unit 
whereas latent variables with the (2) superscript vary at level 2. Note that 
Vj( 2 ) for any two-level model. 

Display 4.2A on page 100 uses the GF formulation to represent the random 
part of a single-level two-factor model for five items, where the first three 
items load on factor 1 whereas the last two load on factor 2. Display 4.3A on 
page 101 uses the same notation to represent the random part of a two-level 
random coefficient model. Here there are three level-1 units in the jth. level- 
2 unit and the model includes a random intercept and a random slope of a 
covariate Uj. 

We can now generalize the model to L levels as 

+ (4.5) 

1=2 

where ^ Z {L) is the vector of linear predictors for all units in a particular level-L 
(top level) unit z and Vz(l) the vector of all (realizations of) the level-l latent 
variables for that level-L unit (see bottom right-hand panel of Display 4.4 on 
page 102). The (Z) superscript of ^(l) denotes that the matrix is specific to 
the level-Z latent variables. As in Chapter 3, Display 3.2 on page 59, we can 
alternatively write this model as a two-level model 

v z{L) = X^( L )/3 +A 2 ( L )^ (L) , (4.6) 

where A z{L) = [A^^,..., A^] and r] z{L) = (^), • ■ •, is the vector 

of all (realizations of all) latent variables for the zth level-L unit (see top 
right-hand panel of Display 4.4) • 
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Display 4.2 Random part of a single-level two-factor model. 






Display 4.3 Random part of a two-level random coefficient model. 

A. GF formulation: 



B. GRC formulation in matrix form: 






C. GRC formulation: 


e 旧 d4 x 音恭 j ㈤ 攀 + 必％ 


4-2.3 Linear predictor in generalized random coefficient (GRC) formulation 
For simplicity, we begin by considering a two-level model. In the GF formu¬ 
lation (4.4)，the structure matrix Aj(2) is neither a pure design matrix nor 
a pure parameter matrix. Instead it contains both variables and parameters. 
We can spell out the form of Aj(2)r^ 2 ) by expanding it in terms of pure design 
matrices Z ③ and pure parameter vectors 入泛 ） so that (4.4) becomes 

M 

〜⑺二 X i(2) /3+^ (4.7) 





where is the mth latent variable, Z ② is an x (design) matrix of 
covariates and fixed known constants and 入贷 ） is a vector of parameters 
associated with the mth latent variable. The product represents the 

mth column of Aj( 2 ) ； 入 is therefore not a vector in the matrix Aj( 2 ) - We 
will refer to this formulation as the 4 GRC formulation in matrix form’. 

A latent variable ”巧 can typically be interpreted as a factor if all elements 
in the corresponding matrix Z 思 are zero or one. In this case A^) contains the 
p 泛 ） nonzero factor loadings for that factor and the role of Z^. is to assign the 
correct factor loadings to the different items. This can be seen in Display 4.2B, 
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for factor models, but will be indexed A^r for the rth element of 入 g) in the 
remainder of the book. 

A latent variable is a random coefficient of some variable Zij if the 
corresponding matrix Z^- is a column vector with corresponding scalar ‘factor 
loading’ A ⑸ set to 1. This is illustrated in Display 4.3B. 

The ith. row of (4.7) becomes 

M 

V ij = AjP + J2 ( 4 . 8 ) 

where is the ith row of X^( 2 ) and is the zth row of Z^-. This is the 
GRC formulation for a two-level model. 

As shown in Display 4.2C on page 100， expressing factor models in this 
notation requires dummy vectors S m i with elements (where p 钇 ） is the 
number of items measuring or ‘loading on’ the mth factor), equal to 1 for 
the element of that represents the factor loading for item i on factor 
m and 0 otherwise. Display 4.3C shows how a random coefficient model can 
be expressed using this notation. Here z^- are scalars with corresponding 
parameters A m = 1. 

The model can be extended to L levels as 

L Mi 

(4-9) 

1=2 m=l 

where Mi is the number of latent variables at level l and we have omitted 
higher-level observation indices to simplify notation. 

See Rabe-Hesketh and Pickles (1999) and Rabe-Hesketh et al. (2000) for 
further examples of the GRC formulation. 


4-2.4 Example: Two-level factor model in GF and GRC formulation 

We can now define a two-level factor model. Such a model is useful if the units 
providing responses to the items are nested in clusters, for instance pupils in 
schools. A single-level factor model would not be appropriate in this case since 
responses from different units in the same cluster are likely to be correlated. 
For continuous or latent responses, it is typically assumed that 

y ; ⑶〜 

〜 N (/ x ， E 2 )， 

where /x and 〜are vectors of intercepts. Separate common factor models are 
then specified to structure the covariance matrices Si and E 2 at the unit and 
cluster levels (e.g.，Longford and Muthen, 1992; Poon and Lee ， 1992; Longford, 
1993, Linda et al.^ 1993; Muthen, 1994; Lee and Shi, 2001). The common 
factors at the cluster-level can then be interpreted as cluster-level constructs 
which may have a different factor structure than unit-level constructs. 
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an arrow pointing at the observed response y represents a possibly nonlinear 
relation, for instance a logit link function, in the diagrams presented in this 
book. The short unlabelled arrows pointing at the observed responses do not 
necessarily represent additive error terms. They could for instance represent 
Poisson variability for counts. 

The GF formulation of this model is shown in Display 4.5A on page 106 for 
a level-3 unit k with two level-2 units j = 1,2. Note that the factor loadings in 
the figure and display are labelled according to the GRC formulation shown 
in Display 4.5B. For the ith. item, each common factor is multiplied by the 
dummy vector d;，with ith element equal to one and all other elements equal 
to zero, to pick the ith. factor loadings from 入 and A^ 3 ). The mth unique 
factor is multiplied by equal to if m=i and 0 otherwise. Note 

that parameter restrictions are necessary to identify the model, for example, 
setting the factor loadings 入泛，入泛， A^ 3 ), A^ 3 ) and to 1. See Section 5.2 
for a detailed discussion of identification. A simpler version of this model is 
discussed on page 110; see also Figure 4.3(a). 


4-2.5 Example: Exploratory latent class model in GF and GRC formulation 

We have so far implicitly assumed that the latent variables are continuous. 
However, we can combine the same response model with discrete latent vari¬ 
ables to define latent class models. Here, two-level models are usually suf¬ 
ficient. Let take on discrete values e c with probabilities 7r c , where we 
constrain the mean to zero, 

y^7r c e c = 0. 

For an exploratory latent class model for dichotomous responses, the linear 
predictor for item unit j and class c can be written as 

^ijc = Pi i = 1， • • • ， I. 

Using the multivariate GF formulation, this becomes 

^(2) = + (4.10) 

where r^ 2 ) = e c , c = 1, ..., C, is an /-dimensional latent variable and X J ( 2 ) =1 
and ^j(2) =1 in this case. 

I is an /-dimensional identity matrix. Using the GRC formulation, the linear 
predictor is 

i 

^ij = (4.11) 

where d; is the ith row of the /-dimensional unit matrix and dmi is the mth 
element of 山 ， a dummy variable for m = i. 

In the case of polytomous or other comparative responses with Si categories 
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for item i, the linear predictor for unit item i, response category s and class 
c is 

^sijc = /?| + 6 | c , 

where ^ = e| c = 0 for i = 1,..., c=l,... ,C. We stack the linear predictors 
for the different response categories and items into a single vector 匕 ⑵. The 
model can then be written as in (4.10) except that now has dimension 
R = J2i Si—I and the identity matrices are replaced by (R-\-I)xR dimensional 
structure matrices, equal to (i? + /) x (R + I) dimensional identity matrices 
with those I columns removed that correspond to the first response categories 
for each item. In the GRC formulation in (4.11) ， d; is then replaced by the ith 
row of this structure matrix. We will use an exploratory latent class model to 
analyze dichotomous diagnostic tests in Section 9.3 and rankings of political 
values in Section 13.5. 


4-2.6 Relaxing conditional independence 


A basic assumption of the model framework is that the responses are condi¬ 
tionally independent given the latent variables and covariates, an assumption 
also known as ‘local independence’. While this may appear restrictive, we can 
always generate more complex dependence structures by including further 
latent variables. For example, in a common factor model we can induce a cor¬ 
relation among two responses, conditional on the common factor, by making 
the responses load on a further latent variable with factor loadings set equal 
to 1. Similarly, Qu et al. (1996) relax the conditional independence assump¬ 
tion in a latent class model by including a common factor for all items with 
class-specific factor loadings. 

To induce dependence between the residuals eij of a latent response in a 
probit regression and C2j of an observed response in a linear regression (as in 
the famous Heckman selection model, e.g. Heckman, 1979), we can specify 

^ Vj )〜+ eij ， Ai = 1 

where the level-1 residuals eij and e^j are independently normally distributed 
with zero means and variances On and 622 - Two further restrictions need to 
be imposed on the four parameters (也 A 2 , On and 622) since only the residual 
variance of the linear regression model, 

Var(e 2j ) •== A^ + 0 22 , (4.12) 


and the correlation between the total residuals of the two models 


Cor(ei t7 -,e 2 j) 


_A 2 V ； _ 

\/(V 7 + ^nKAfVH" 沒 22) 


are identified (see Section 5.2 for a general treatment of identification). We 
cannot set A 2 to a constant because this would determine the sign of the 
correlation. An obvious choice would therefore be to set On = 1 (as usual 
in probit regression) and = 1. However, the correlation between the total 
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residuals of the two models then becomes 

Cor ( ，烏 ')=v(2)(ai + m ~^ m = ^ 
where the upper bound results if 622 in the linear regression model is zero. To 
avoid an upper bound (less than 1) for the correlation, we therefore suggest 
the restrictions 

*0 = 1, On = @ 22 - 

For categorical responses, we can also relax conditional independence with¬ 
out including further latent variables in the model. Consider two dichotomous 
responses y\ and 7 / 2 - We can treat the four possible response patterns (0,0 )， 
(1,0), (0,1) and (1,1) as a single multinomial response and model the proba¬ 
bilities as 

p / x = _exp(/?iyi + p2V2 + /3i2ym) _ 

2/1’ 2/2 J2 Z1 =0,1 ^2z 2 =0,i exp(/?i^L + P2Z2 + P12Z1Z2) 

Extra dependence, in addition to that induced by latent variables, results 
if /?i 2 7 ^ 0 . This way of introducing ‘local’ dependence has been suggested for 
latent class models by Harper (1972) and Hagenaars (1988) among others, but 
is generally applicable to latent variable models with categorical responses. 
For instance, in item response modeling，local dependence among a group 
of items is sometimes accommodated by combining the items into a single 
response called 4 testlet’ (Wainer and Kiely, 1987) or 4 item bundle’ (Wilson 
and Adams, 1995). 


4.3 Structural model for the latent variables 

4-3.1 Continuous latent variables 

In order to define the structural model, we first define the latent variable 
vector r]j z = (Tj^jk.[. z ^ ^..., containing all latent variables for the 

jth level-2 unit, where k … z are the indices for units at levels 3 to L. The 
vector could also be denoted rjjk...z( 2 ) or simply ly. For a three-level model, 
r]j k is shown for j = 2 in the top left panel of Display 4.4 on page 102. The 
structural model for the latent variables has the form (omitting higher-level 
subscripts) 

Vj = + Twj + Cj, (4.13) 

where B is an M x M matrix of regression parameters, M = Wj 

is a vector of R covariates, r is an M x matrix of regression parameters 
and q is a vector of M errors or disturbances. This model is essentially a 
generalization of the conventional single-level structural model (e.g. Muthen, 
1984) to a multilevel setting. The crucial difference is that latent and observed 
variables may vary at different levels in our framework. Each element of Cj 
varies at the same level as the corresponding element of r]y 
It would not make sense to regress a higher level latent variable on a lower 
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level latent or observed variable since this would force the higher level vari¬ 
able to vary at a lower level. In terms of the blocks of B corresponding 
to the vectors of latent variables at each level, r)^ l \ the matrix B is there¬ 
fore upper block-diagonal. Similarly, if the covariate vector is written as 
Wj = ( w 货)’ w 么 3 )’ z , •.., wi L )’)’， the matrix r is upper block-diagonal: 




- B( 22 ) 

B( 2 3) 

... B( 2L )] 

r (2) 

0 

g(33) 

… b (3L) 


0 

0 

... b ( ll) . 

u) 

_ r (22) 

p(23) 

... r ㈣ ■ 

[ w lr" 

K...z 

0 

p(33) 

... r( 3L ) 

. 0 

0 

… r( LL ) 

— 士 、— 


Cg).. z 

广⑶ 

e) 

Block B( ab ) contains regression parameters for the regressions of r / ⑷ on 77 ⑼ 
and similarly for r( ab ). Note, however, that unlike ry, Wj need not contain 
subvectors for each level. There may for example be a single covariate at 
level L (R= 1, Wj = wi L ^). We will henceforth omit the superscript from the 


(4.14) 


covariates w. 

The model becomes easier to estimate, and easier to understand, if the 
regressions among latent variables at a particular level are recursive. In this 
case the elements of r/( z ) can be permuted in such a way that the blocks B( aa ) 
on the diagonal are strictly upper diagonal. The expression for can then 
be substituted into the expression for to r^-i, the expression for 77 ^ 
into the regression for to r^- 2 , etc” until all are eliminated from 
the right-hand side of the equation. Substituting the final expressions into 
the linear predictor then yields what we will call the reduced form for the 
latent variables (see Section 4.6)，where the only latent variables remaining 
on the right-hand side are the disturbances C- An implication of restricting 
the relations to be recursive is that we cannot have simultaneous effects with a 
particular latent variable regressed on another and vice versa. However, such 
models are rarely used in practice, possibly due to a combination of conceptual 
complexity and identification restrictions that are often deemed unpalatable. 


Examples 

An example of a structural model involving two latent variables at level 2 and 
one latent variable at level 3 is given by 
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which is shown in path diagram form in Figure 4.2. Here r]^ k is regressed both 



Figure 4.2 Example of a multilevel structural equation model 

on a same-level latent variable r^) k and a higher-level latent variable 77^, as 
well as an observed covariate Wjk varying at level 2. Reversing the path 613 
to 631 would not make sense since would then be forced to vary at level 
2. Adding a path 621 from 77^ to rj^ k would render the relations at level 2 
nonrecursive. 

We will now consider an alternative to the two-level factor model consid¬ 
ered in Section 4.2.4 which is shown again Figure 4.3(b). As illustrated in 
Figure 4.3(a), we simply retain the level-2 model (inside the inner box) and 
allow the level-2 factor to vary at level 3 by adding a regression of the level-2 
factor on the level-3 factor. Such a model is referred to as a variance com¬ 
ponents factor model in Rabe-Hesketh et al. (2004a). The model is arguably 
much easier to interpret than the less structured alternative in Figure 4.3(b). 
The common factor, defined through its relationship to the observed items 
at the unit level, simply has a component of variation at the cluster level. 
The model is analogous to a MIMIC model, with the crucial difference that 
the common factor is regressed on a latent variable varying at a higher level 
instead of an observed variable. Including unique factors at the higher level 
is analogous to including direct effects in a MIMIC model. An advantage of 
the variance components factor model is that it is easier to incorporate within 
a structural equation model than the general two-level factor model. In fact, 
such a model formed part of the previous example in Figure 4.2, where r][j k 
is the common factor at level 2 and is the variance component at level 3. 
A variance components factor model is used to analyze attitudes to abortion 
in Section 9.8. 

Structural models that are nonlinear in the latent variables have also been 
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Figure 4.3 (a) A variance components factor model and (b) a general two-level factor 
model 


proposed (e.g. Busemeyer and Jones, 1983; Kenny and Judd, 1984). Arminger 
and Muthen (1998) discuss a nonlinear version of the LISREL model for con¬ 
tinuous responses (see Section 3.5). The structural model for latent response 
variables r]j in terms of the latent explanatory variables is given by 

Vj = + r«j + Cj, ocj = g ⑷). 

Here, g(^) is a known deterministic vector function and normality is assumed 
for and Cj- Special cases include polynomial regression models where 

= K ? •乂,，…乂 f ] ， 

and (first order) interaction models where 

aj = Klj 5 ^2j J …， ^qj 5 ^lj^2j , ^lj^Sj 5 …， ^q-lj^qj]- 

Setting r]j = OLj produces a nonlinear factor model. We refer to Joreskog 
(1998) for an overview of nonlinear structural equation modeling; see also 
other contributions in Schumaker and Marcoulides (1998). 


4-S.2 Discrete latent variables 

For discrete latent variables, the structural model is the model for the (prior) 
probabilities that the units belong to the corresponding latent classes. For a 










This probability may depend on covariates Vj through a multinomial logit 
model 


_ exp(v^ c ) 

E d exp( 々 d ) 

where q c are regression parameters with ^ = 0 imposed for identification. Such 
a ‘concomitant variable’ latent class model is used for instance by Dayton and 
MacReady (1988) and Formann (1992). The multinomial logit parametrization 
is useful even if the class membership does not depend on covariates since it 
forces latent class probabilities to sum to one. 

It is sometimes useful to use the following constraint for the locations. Let 
7r c denote the probabilities when the covariates Vj are zero (except for the 
constant) • Then the e c for c = 1, • •. ， C — 1 are free parameters and ec is 
determined by setting the mean location to zero 


c 

^7r c e c = 0. 


An advantage of this parametrization is that the mean structure can be speci¬ 
fied in the fixed part of the model x^/3 as in continuous latent variable models. 

If the latent classes are ordered along a dimension, as in a discrete one- 
factor model, ordinal models can be specified for the latent class probabilities 
either by constraining parameters in (4.15) or using cumulative models (see 
Section 2.3.4). 

If there are several discrete latent variables, we can either use model (4.15) 
where c labels the combinations of categories for the latent variables, or we can 
parameterize the model as a log-linear model with main effects and interac¬ 
tions of the latent variables. Hagenaars (1993) and Vermunt (1997) considered 
regressions of discrete latent variables on other discrete latent variables at the 
same level. Vermunt (2003) extends the structural model to include higher 
level continuous or discrete latent variables in the linear predictor of (4.15); 
see also Section 4.4.4. 


4.4 Distribution of the disturbances 

To complete model specification we must specify the distribution of the dis¬ 
turbances C in the structural model. If there is no structural model the latent 
variables simply equal the disturbances; rj = (^. 

The dependence structure of the disturbances is specified by the number of 
levels L and the number of disturbances Mi at each level. A particular level 
may coincide with a level of clustering in the hierarchical dataset. However, 
there will often not be a direct correspondence between the levels of the model 
and the levels of the data hierarchy. For instance, in factor models items were 
treated as units at level 1 and subjects as units at level 2. 

The terms 4 unit at a level 5 and 4 disturbance at a level 5 are defined as follows: 
• a unit at level 1 is an elementary unit of observation, 
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• a unit k at level Z > 1 is a cluster of level-1 units, 

• the level-1 units in cluster k at level Z > 1 fall into n^ -1 ^ subsets represent¬ 
ing units at level l — 1, 

• a disturbance C( z ) at level l varies between the units at level l but not within 



• the units at level l are conditionally independent given the disturbances at 
levels i + 1 and above and any explanatory variables. 

The basic assumption is that disturbances at the same level may be depen¬ 
dent, whereas disturbances at different levels are independent. In the following 
subsections we describe different specifications of the distribution of C- 

K.i Continuous distributions 

In the case of continuous disturbances, the predominant distributional as¬ 
sumption is multivariate normality with mean zero and covariance matrix 
屯⑴ at level l. An advantage of this distribution is that the means, variances 
and covariances are explicitly parameterized and can be freely specified. Im¬ 
portantly, the likelihood cannot be expressed in closed form in this case unless 
the responses are conditionally normally distributed. However, as discussed in 
Section 6.2, closed form expressions exist for some combinations of latent vari¬ 
able and response distributions in the case of simple random intercept models 
with between-cluster covariates only. Wedel and Kamakura (2001) discuss 
factor models with independent factors having any distribution from the ex¬ 
ponential family, although these models generally do not have closed form 
likelihoods. 

Fortunately, in many cases inferences appear to be surprisingly robust to 
departures from normal disturbances (e.g. Bartholomew, 1988, 1994; Seong ， 
1990; Kirisci and Hsu, 2001; Wedel and Kamakura, 2001). Several attempts 
have nevertheless been made to ‘robustify’ the disturbance distribution. Pin- 
heiro et al. (2001) consider multivariate 亡 -distributions and find these more 
robust against outliers than the multivariate normal. 

In order to avoid making strong assumptions about the distribution of the 
disturbances, flexible parametric distributions can be used such as finite mix¬ 
tures of (multivariate) normal distributions (e.g. Uebersax, 1993; Uebersax 
and Grove, 1993; Magder and Zeger ， 1996; Verbeke and Lesaffre ， 1996; Al- 
lenby et al, 1998; Carroll et al, 1999; Lenk and DeSarbo, 2000; Richardson 
et al” 2002). In some case the components of the finite mixture are inter¬ 
preted as subpopulations, for instance those with or without a disease in Qu 
et al (1996); see also Section 4.4.4. In the Bayesian setting there has recently 
been considerable interest in modeling disturbances via semiparametric mix¬ 
tures of Dirichlet processes (e.g. Muller and Roeder, 1997; Chib and Hamilton, 
2002 ). 

Another approach is to use a truncated Hermite series expansion as sug¬ 
gested by Gallant and Nychka (1987) and Davidian and Gallant (1992). 
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4-4-^ Nonparametric distributions 


Instead of making distributional assumptions regarding the distrurbances, we 
can use 4 nonparametric maximum likelihood estimation’ (NPMLE) (Laird, 
1978). Principally in the context of random intercept models, Simar (1976) 
and Laird (1978)，and more generally Lindsay (1983)，have shown that the 
nonparametric maximum likelihood estimator (NPMLE) of the unspecified 
(possibly continuous) distribution is a discrete distribution with nonzero prob¬ 
abilities 7r c at a finite set of locations e c , c = 1， •. •，C as shown in the upper 
panel of Figure 4.4 (see also Lindsay et al, 1991; Aitkin, 1996, 1999a; Rabe- 
Hesketh et al., 2003a). For this reason the model is often referred to as a 
semiparametric mixture model. The cumulative distribution function of the 
disturbance is a step function as shown in the lower panel of Figure 4.4. 

For a multivariate disturbances with M elements，the masses are located at 
points e c in M dimensions (e.g. Davies and Pickles, 1987; Aitkin, 1999a). See 
Section 9.5, page 304, for an example with M = 2, where the two-dimensional 
mass-point distribution is displayed in two different ways in Figure 9.7. Ver- 
munt (2004) describes NPMLE for three-level models. 

For a given number of masses, the locations and probabilities can be esti¬ 
mated jointly with the other parameters using maximum likelihood estima¬ 
tion. The number of masses can then be increased until the largest maxi¬ 
mized likelihood is achieved. Alternatively, a model with a very large number 
of masses can be estimated so that redundant masses will either merge with 
other masses (sharing the same location) or have zero probabilities. A method 
for determining if a given C corresponds to the NPMLE, based on the direc¬ 
tional derivative, is discussed in Section 6.5. Maximum likelihood theory for 
NPMLE is reviewed by Lindsay (1995) and Bohning (2000). Heinen (1996) 
denotes NPMLE 4 fully semiparametric ? and refers to the simpler approach 
where masses are estimated but locations fixed a priori as ‘semiparametric’. 

An important advantage of NPMLE is that it is appropriate regardless of the 
disturbance distribution. The true distribution could be continuous (normal 
or nonnormal), discrete or continuous with discrete components. Relying on 
NPMLE, we can concentrate on the specification of other model components 
and need not worry about the nature of the disturbance distribution. We use 
NPMLE for instance in Sections 9.5, 11.2, 11.3.3 and 14.2. 

U.3 Discrete distributions 

If the latent variables are discrete, we model their distribution using multino¬ 
mial logit models as discussed in Section 4.3.2. 


4-4-4 Mixed continuous and discrete distributions 

Models with both continuous and discrete latent variables can take different 
forms. 

The first incudes both types of latent variable in the response model. In a 
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Figure 4.4 Discrete distribution and cumulative distribution 


model for rankings Bockenholt (2001b) includes a discrete alternative-specific 
random intercept as well as continuous common factors and random coeffi¬ 
cients. Similarly, McCulloch et al. (2002) specify a ‘latent class mixed model’ 
with both discrete and continuous random coefficients for joint modeling of 
continuous longitudinal responses and survival (see Section 14.6 for a simi¬ 
lar application). The latent classes are interpreted as subpopulations of men 
differing both in their mean trajectories of (log) prostate specific antigen and 
in their time to onset of prostate cancer. Variability among men within the 
same latent class is accommodated by the (continuous) random effects. Both 
Bockenholt (2001b) and McCulloch et al. (2002) treat the discrete and con¬ 
tinuous latent variables as independent of each other. Note that the sum of 
a discrete and continuous (zero mean, normally distributed) latent variable 
is just a finite mixture of normal densities with equal variances, the discrete 
variable representing the component means. 

The second kind of model has only discrete latent variables in the response 
model and there are continuous latent variables in the structural model. Such 
a model was proposed by Vermunt (2003) in the multilevel setting. The item- 
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level model is a conditional response model for item i, unit j, cluster k, given 
class membership c. For unordered categorical responses with categories 5 = 
1, … ， S, the model can be written as 


Pr (恥 fc = s|T7g)=e c )= 



Ef=i exp%。) 


(4.16) 


where is the linear predictor for category 5. The unit-level model is a 
multinomial logit model for class membership, 


Pr (^fc= e c)= 



E 6 exp(c4) 


where the linear predictor a c - k of the structural model includes a cluster-level 
random intercept, 

a jk = v jfce c + 4 3) - 

Here, Vjk are unit- and cluster-specific covariates with fixed class-specific co¬ 
efficients g c . A normal distribution is specified for the cluster-level random 
intercept. Vermunt remarks that it is often useful to assume that the condi¬ 
tional response probabilities do not depend on the clusters (by dropping the k 
subscript in (4.16)). He also points out that the cluster-level random intercept 
can be specified as discrete. 

Third, the response model could contain only continuous latent variables 
whereas discrete latent variables appear in the structural model. The struc¬ 
tural model, where continuous latent variables are regressed on discrete latent 
variables, is usually more complex than the conventional structural models 
(only including continuous latent variables). For instance, the covariance ma¬ 
trix of the disturbances may depend on the discrete latent variables. The most 
common structural model is a finite mixture of multivariate normal distribu¬ 
tions 

c 

5^7r c " c (C )， 

C=1 

where c indexes the components, n c are the component weights and h c (C) is 
a multivariate normal density with component-specific mean and covariance 
parameters. Such a model was used by Verbeke and Lesaffre (1996), Allenby 
et al. (1998) and Lenk and DeSarbo (2000) for random coefficient models, by 
Magder and Zeger (1996), Carroll et al (1999) and Richardson et al (2002) 
for covariate measurement error and Uebersax (1993) and Uebersax and Grove 
(1993) for measurement models with dichotomous and ordinal responses. 

Finally, the most general model allows any oi the parameters of conventional 
structural equation models with continuous responses to depend on discrete 
latent variables. Both the response model and structural model can there¬ 
fore differ between latent classes, giving a multiple-group structural equation 
model of the kind proposed by Joreskog (1971a), with the crucial difference 
that group membership is unknown. 

Yung (1997) and Fokoue and Titterington (2003)，among others, consider 
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the special case of finite mixture factor models. Yung’s model can be written 
as 

Yjc = Pc + A cVj C + (4.17) 

where rjj C are continuous common factors with class-specific variances 
= Var^J, 

and the unique factors have class-specific covariance matrices, 

@ c = Var[€j C ]. 

Fokoue and Titterington assume that @ c = © and 屯 c = I. In the context of 
diagnostic test agreement, Qu et al. (1996) specify a unidimensional probit 
version of this model. They interpret the model as a latent class model with 
a ‘random effect，(the common factor) to relax conditional independence. 

Blafield (1980), Jedidi et al (1997), Dolan and van der Maas (1998), Arminger 
et al (1999), McLachlan and Peel (2000), Wedel and Kamakura (2000), Muthen 
(2002) and others specify 4 finite mixture structural equation models, by in¬ 
cluding a structural model 

Vjc = Bc/ 7 ^ + T c w jc + C jc , = Var[C JC ] 
for the factors in (4.17). 

4.5 Parameter restrictions and fundamental parameters 

The parameters of the c data generating model’，presumed to have generated 
the observed data, are called structural parameters. Let 0 be the vector of 
all structural parameters including the regression coefficients (3^ the factor 
loadings m = 1, …， M “ Z = 1,..., L, the nonduplicated elements of the 
covariance matrices 屯 ⑴， l = 1 ， … ， L, the parameters l for modelling level- 
1 heteroscedasticity, the threshold parameters (； and the class membership 
parameters q. Note that the structural parameter vector 0 should not be 
confused with the residual covariance matrix © with elements Qu，. 

More or less complex restrictions, such as the sign of a parameter or equality 
between parameters, are often required. These restrictions can be imposed via 
reparameterization in terms of so called fundamental parameters ^ which are 
unrestricted. The resulting reparameterized model is equivalent (in the sense 
of Chapter 5) to the original structural model with parameter restrictions. The 
main idea of this approach is to solve the implicit functions among structural 
parameters for a subset of fundamental parameters. An important merit of 
this approach, apart from its generality, is that unconstrained optimization 
procedures can be used in the estimation phase. This avoids the more complex 
estimation approach with restrictions imposed using for instance Lagrange 
multipliers (e.g. Fletcher, 1987). 

Each structural parameter 6k is specified as a known one time differentiable 
function of the fundamental parameters 

0k = h k ⑼. 
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Observe that hk(0k) = for some l whenever 6k is unconstrained, which is 
usually the case for most structural parameters. The full set of restrictions 
can be written in terms of a vector function 

0 = h ⑼. 

The following lists different types of restrictions and their implementation 
via reparameterization of the structural parameters 0 in terms of fundamental 
parameters Let there be K structural parameters 6k, k = 1,... ,K and let 
a/c, bk and a be specified constants, including zero. 

1. Identity restrictions are perhaps the most common. For instance in growth 
curve modeling, the residual variances are often constrained equal across 
occasions, corresponding to the assumption of homoscedasticity. The re¬ 
strictions are of the form 

Ok = Oi, k^l 

and are implemented as 

Ok = 

Oi = 'dk- 


2. Linear restrictions are of the form 

K 

〉: — Oj. 

fc=i 

Such restrictions are often useful for simplifying the model structure. For 
example, in a threshold model for ordinal responses, linear restrictions can 
be used to force the thresholds to be equally spaced. Linear restrictions are 
implemented as 

a —abOh 

e k = ^ k ] and 0 K = — k=1 . 


3. Inequality restrictions of the form 

Ok > a k and 6 k > a k 


are frequently required, a typical example being that a variance is positive. 
These restrictions are implemented as 

Ok = dk-\~{ r &k) 2 


and 


9 k = a k -\-exp(^ k ). 


Note that in situations where the unconstrained parameter would be esti¬ 
mated as less than or equal to ak, so that the constrained parameter must 
be ak to maximize the likelihood, the second parametrization can cause 
difficulties with convergence. This is because 'dk will take on very large 
negative values and the likelihood will appear flat with respect to 
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Another type of inequality restriction are order restrictions of the form 
< 02 ^ ^ Ok and 9i < 62 K < Ok 


which can be implemented as 

^1 = ^ 1 , ^ = ^-1 + (^) 2 , k>l 


and 

沒 i = ^i, 心=办一 l+exp (如) ，fe > 1. 

Such restrictions are required for example for the stereotype model de¬ 
scribed in Section 2.3.4 where a 1 < … < a s_1 and for the cumulative 
models in the same section where Kn < ... < Kis-i- Note that estimates 
for these models usually obey these restrictions even when they are not 
explicitly enforced. 

4. Domain restrictions of the form 


dk < Ok < bk 

can be implemented as 

^g fc +b fc exp(^ fc ) 

k 1+exp ( 九 ） • 

For example, probabilities must lie in the range [0 ， 1]，giving the familiar 
multinomial logit transformation, useful for example for latent class prob¬ 
abilities. Correlations can be restricted to lie in the permitted range by 
setting afc = —1 and 6^ = 1. 

5. Nonlinear restrictions are often of the form 

K L 

a k jj = a * 

k=l 1=1 

A simple example is = which can simply be imposed as 

Oi = ^-^2 where dk = ^k ： fc = 2,3,4. 

An application of this would be the restriction that two reliabilities (ratio 
of true score to total variance) are equal. This is illustrated in the life 
satisfaction example in Section 10.4; see also Table 10.13. 

Nonlinear inequality restrictions are implied by the requirement that a ma¬ 
trix M is positive semi-definite, (i.e. a’Ma > 0 for all a), the classical 
example being a covariance matrix M. We can use the Cholesky decompo¬ 
sition L of M, 

M = LI/ 

to impose the restriction where 0 are the nonduplicated elements of M and 
分 are the lower diagonal elements of L. 
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variables 


4-6.1 Reduced form for rjj 

Remember that is shorthand notation for ry k ". z , comprising all latent 
variables for the jth level-2 unit. Assuming that (I — B) is invertible, the 
structural model in (4.13 )， 



4-6.2 Reduced form for 

We will now consider the reduced form models for the latent variables at each 
level of the model. Recall from (4.14) that B and T are upper block diagonal 
since latent variables varying at a given level l can only be regressed on latent 
or observed variables varying at the same or higher level. It follows that II i 
and n 2 are also upper block diagonal, 



and similarly for II 2 . 

Reduced form models for the latent variables at each of the levels l = 



Here, w( i+ ) = (w ⑴ , ，… ， w( L )’)’ is a subvector of w containing all variables 
varying at level l or above, and similarly for C( Z+ ). np + ) is the submatrix of 
IIi with rows corresponding to block l and columns corresponding to blocks 
l to L, and analogously for nf+). 

4-6.3 Reduced form for the linear predictor ^ z (l) 

We will now derive the reduced form of the vector of linear predictors ^ Z (L) 
for the zth. L-level unit. From the 4 two-level representation’ in (4.6)，the vector 
of linear predictors for a top-level unit can be written as 




Recall that V Z (L) = (”%), ”%)， , ” 黑))’， the vector of all (realizations of 
all) latent variables for the zth top-level unit. Let = (nP) ， ... ， n ^))， 

the same matrix column-appended as many times as there are level-Z units 
in the zth level-L unit. Letting n 2 ^(L) be an upper block-diagonal matrix 
analogous to (4.19) but with blocks n ^ L )， 

Vz(L) = n 22 ： (L)C(L)- 

The linear predictors can then be expressed as (omitting the z subscript) 

V {L) = X( L )/3 + A( L )IIi( L )W( I/ ) + A( L) n 2 (L)C(L) 

=X( L )/3 + A ； l( l )w( l ) + A 2 (i / )C(l)? (4.21) 

where /?, 三 and A 2 (l) = are parameters of the 

reduced form linear predictor. 

4.7 Moment structure of the latent variables 

4-7.1 Moment structure of ly 

From (4.18)，the mean structure of rjj becomes 

= n i w j, (4-22) 

and the covariance structure becomes 

Covirjjlwj) = n 2 Cov(C j )U , 2 . (4.23) 

The covariance matrix of the disturbances Cov(Cj) has a block diagonal form 
with blocks 屯⑴ for level l. Since the disturbances are independent across 
levels, the covariance matrix in (4.23) can be written as 

Cov^lw^) = Y J U 2 l) ^ ( ' l) ' n 2 l)， ， (4.24) 

1=2 

where U 》、are blocks of II 2 as shown for III in (4.19). 


4-7.2 Moment structure of 

It follows from (4.20) that the conditional mean structure of the latent vari¬ 
ables r / ⑴ at level l given w(^+) becomes 

E(ry ⑴ |w( z+ )) = nf +) w( z+ ). 


The covariance structure at level l is 
Cov(r/ z )|w ㈣ ） 


nf +) Cov(c (z+) )(nf +) ) / 

^n^ c) ^ (c) n^ c)/ , 


(4.25) 
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and the covariance structure between latent variables at levels a and b, a<b, 
is 

Cov(” (a W b) |w( a+ )) = n^ a6+) Cov(c (b+) )(n^ 6+) ) / 

L 

= [r4 ac) ^ (c) n 》 c) ’， (4.26) 

c=b 

where the second equalities in (4.25) and (4.26) follow from the block diagonal 
form of Cov(C (i+) ). 


4-7.3 Example: 3-level model 


It is instructive to consider the special case of a 3-level model. The structural 
model can in this case be expressed as 

裳 ][$.] 會 r s ][: phi ]. 

From (4.20)，the reduced forms for the latent variables are 

必） =nrM^+nr)^) 

=[nf 2 ) nf 3 ) ] [ ] + [ n 2 22) n 2 23) ] [ ] > 

and 

nf : i.n^wf + nf>ci 3) . 

We can find expressions for the parameter matrices by first solving for rj^ 
and then substituting the reduced form of in the structural model for r ] 资 
and solving for rj ^, giving 

nf 2) = ( 1 - 6 (22 ))-^( 22) , 

nf 3) = (i - b^ 22 ))- 1 [b( 23 )(i - b^)-^^ 3 ) + r( 23 )], 

nf 2 ) = (i-b 。 2 ))- 1 , 

nf 3) = (I - - b^)- 1 , 

nf 3) = (i-B (33) ) _1 r (33) , 

r4 33) = (i-b (33) 疒 1 . 

The conditional expectations are 
E(^|wf,wf) = n^wf + n^wf 
=(I-B (22 >)- x x 

{ [b (23) (i- B (33) ”r (33) + r (23) ] wf) + r (22) wg)} 
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and 


E(^ 3 Vi 3) ) = 

=(I —bP 3 ))- 1 ]^ 33 )^^ 3 ). 


The conditional covariance matrices are 


Cov ( 必)| wg ), w ?)) = n ? 2 ) 巾⑶时 印+时 3 ) 巾⑶时 3) ’ 


(I 一 B( 22 ))_i x 


[B( 23 )(I _ B 03 )) -1 屯 ⑶ (I _ B (33) 广 1 ’B (23)/ + 屯 (2) ] x 
(I 一 bP 2 ))- 1 ’ 


Cov^fV^) = nf 3 ) 屯⑶ np)' 

= (工-丑 ㈣广屯⑶卩-:^ 33 ))- 1 '， 

and 

⑶ II 严 

=(I- B 03 )) -1 屯 (3) (I _ B (33) )- 1/ . 


4.8 Marginal moment structure of observed and latent responses 

Recall that conditional on the latent variables, the response model is a general¬ 
ized linear model with link function g{y). The regression parameters therefore 
represent conditional effects of covariates given the latent variables. In some 
instances the population averaged effects or marginal effects (averaged over 
the latent variable distribution) are of interest. We first consider the mean 
structure of the responses, the expectation of the responses as a function of 
the covariates but marginal with respect to the latent variables. Subsequently, 
we derive the marginal covariance structure. 


4-8.1 Marginal mean structures and population average effects 

We now consider the expectation of the response conditional on the covari¬ 
ates, but marginal with respect to the latent variables. Applying the double 
expectation rule, 

E (y ㈤ l x (i) ， w (L)) = E^{E(y (L) |X (Z/) ,w^j., ， C (Z/) )} (4.27) 

= …⑹) [ n ^^ S ))] 

where /i ⑴ (•）is the density of the disturbances at level 1. This is simply the 
‘population averaged 5 response for given covariate values. So-called population 


© 2004 by Chapman & Hall/CRC 




averaged effects or marginal effects (with respect to the latent variables) are 
obtained by considering the relationship between this expectation and the 
covariates. 

We will consider the general form of the linear predictor in (4.21)，written 
for a single level-1 unit as 

y = x'^ + aiw( L) +a' 2 C (L) , 

where is a row of an d similarly for a^. For instance, in a two-level 

random coefficient model with continuous latent variables, 

= x-^ + z^-C^, 

ai = 0, a .2 = Zij and C(l) =C》 2 ). A factor model has the same structure except 
that a .2 is a row of the factor loading matrix A. 

For an identity link, ^ _1 (z/) = the expectation simplifies to 

E( 2 /|x,w (i) ) = x^ + a^w^), (4.28) 

so that the link function is retained and the marginal effects are equal to the 
conditional effects, whatever the latent variable distribution. 

For a log link, p _ 1 (z/) = exp(z/), the marginal effects are equal to the con¬ 
ditional effects, apart from the intercept, regardless of the latent variable dis¬ 
tribution. For normal latent variables we obtain 

E(t/|x,w (i) ) = exp(x , /3 + a^w^) + a2^( L) a 2 /2), 
which in the random intercept case reduces to 

E (y«l x ij， z «) = exp(x^ + ^)/2), 

so that the marginal regression parameters are equal to the conditional pa¬ 
rameters except for the intercept which increases by 則2 . 

The subsequent results are confined to normal latent variables. For a probit 
link , it is convenient to derive the expectation using the latent response for¬ 
mulation (see also Section 2.4). The model can be written as 

y* = x 7 /3 + a , 1 w (L) + a 2 C ( z-) + (4.29) 



where ^ is the 4 total residual 5 which is normally distributed with zero mean 
and variance 屯 ( L )a 2 + 1 and y = 1 if y* >0 and y = 0 otherwise. Note 
that the mean structure for latent responses is identical to that presented in 
Section 4.28 for the identity link. 

The expectation of the observed response becomes 

E(y|x,w) = Pr(y = l|x,w) = Pr(y* > 0|x, w) 

=Pr (- dx’/3 + aiw (I/) ) = Pr(^ < x 7 /3 + aiw (L) ) 
x’/j + a^wa) 

\Z a 2^(-C/) a 2 + l 



Pr (- 






=^(x'/r+afw ㈤ ）， (4.30) 


窝 ^ ( x’/3 + a ’， \ 

\ \/^2^(1^2+1 y 

where /3* and aj are the original parameter vectors divided by the total resid¬ 
ual standard deviation of the latent response. For the probit link, the link 
function is thus retained, but the marginal effects are attenuated relative to 
the conditional effects. Following the same reasoning, it is evident that an 
analogous result holds for the cumulative probit model for ordinal responses. 

Sometimes the variance of the latent response y* in the probit model is set 
to one, 屯 (i)a 2 + 0 = 1， instead of fixing the variance of the error term e 
as above; see for instance Muthen (1984). Note that marginal effects are not 
attenuated in this case. 

For a logit link , the model is as in (4.29) but with e specified as logistic, 
yielding 

Var(^) = a’ 2 屯⑹ a 2 +7T 2 /3. 

This total residual has a compound logistic-normal distribution. Multiplying 
the total residual by the factor 


/ 沪 / 3 

I a , 2 ^ (L) a 2 +7r 2 /3 


^l + O.SOa^^as 


we obtain a residual with the variance of the logistic distribution. Since the 
logistic and normal distributions are very similar, 


[ i < x’/3 + a;w ⑹ \ 

y -y/1 + 0.30a2^(L)a2 -y/l + O.SOa^ (1,)^2 J 


( x’/3 + a / iw (I/) 、 

^l + 0.30a^^^y 


where F is the logistic cumulative distribution function. Note that Zeger et 
al. (1988) use (1 雜尸 a 0.35 in place of 0.30. 

Other link functions are generally not preserved under marginalization. A 
useful discussion of conditional and marginal effects for different models is 
given by Ritz and Spiegelman (2004). 


4-8.2 Marginal covariance structures 

In general, the covariance structure marginal with respect to the latent vari¬ 
ables can be derived using the relation 

Cov(y (i) |X (i) ，w 河） ' =E^[Cov(y (L) |iy (£) )i4- ； Cov^[E(y (i) |i/ (z , ) )]. 

The marginal covariance between different level-1 units i and i' (omitting 
higher-level subscripts) becomes 

Cov(y“w|X ⑹ , w (L) ) = Ec[Cov(y “沢 |z/w)] + Cov<[E (队 | 巧 ) ， E(w| 吵 )] 
=Cov^[5 _1 (^),5( _1 (i/i/)], (4.31) 

the covariance between the conditional expectations. Note that the first term 
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Table 4.2 Nomenclature for bivariate normal latent response correlations 



Continuous 

Dichotomous 

Ordinal 

Censored 

Continuous 

Dichotomous 

Ordinal 

Censored 

Pearson 

Biserial 

Polyserial 

Tobitserial ? 

Tetrachoric 

Polychoric 

Bitobit? 

Polychoric 

Polytobit? 

Tobit 


? We may have invented these terms 


above is zero because Cov(%, 仍， | 巧， zv) = 0 due to conditional independence 
given the latent variables. All pairs of units having the same two values of 
the linear predictor will have the same marginal covariances and correlations. 
However, with the exception of an identity link, the 4 intraclass’ correlation 
between the observed responses will differ between clusters with different 
cluster-specific covariates and between pairs of units within clusters if there 
are lower-level covariates (such as time in longitudinal or panel data). 

For continuous responses and latent responses , the marginal covariance struc¬ 
ture is 

n (L) = Cov{y* (L) \X {L) ,w {L) ) = A 2(i) 屯 ㈤ A’ 2(i) +©, (4.32) 

where 0 is the typically diagonal covariance matrix of the 4 errors’ e (see 
Display 4.1). This yields the correlation structure 

P(L)=Cov(y* {L) \X {L) ,w {L) ) = [Diag(f2 (z , ) )]-5n (I , ) [Diag(f2 ( i, ) )]-i (4.33) 

Consider the special case of multinormal latent responses. The possible 
combinations of response types and corresponding names given to the latent 
response correlations are presented in Table 4.2. Note that historically these 
names have referred to bivariate correlations without conditioning on explana¬ 
tory variables but will here be used generally. 

For a Poisson response with a log link, the marginal variance becomes 

Var^^w^)) E^[Var(t/|x, w (L) , C (L) )1 + Var^[E(y|x, w (i) , C (i) )] 

- E([exp(x’/3 + aiw _| 斗 a’ 2 C (i) )] 

+Var^[exp(x , /3 + a , 1 w (Z/) + a’ 2 C (i) )] 

与 E(y|x, w (L) ) + exp(x , /3 + a , 1 w (z , ) ) 2 Var[exp(a 2 C(i))]- 
For normal <， it follows from a general result for the variance of log-normal 
random variables (e.g. Johnson et al., 1994, p. 212) that 

Var(exp(a&C ( 句 ) ） =exp(a’ 2 屯 ㈤ a 2 )[exp(a’ 2 屯 (L) a 2 ) - 1], 

so that 

Var(y|x, w (L) ) = E(t/|x, w (L) ) {l + E(y|x, w (i) )[exp(a’ 2 屯 (L) a 2 ) — 1]}. 
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Without latent variables, the marginal variance reduces to E(y|x, w( L ))，the 
variance function of the Poisson distribution. The latent variables therefore 
lead to an increase in the variance that is proportional to the expectation 
squared. In contrast, the usual quasi-likelihood approach sets the variance 
function equal to ^>*E(y|x ， w( L )) (see Section 2.3.1), thus assuming that the 
variance is proportional to the expectation. 

For dichotomous responses, the marginal variance is 

Var(y|x ， w (i ^: = Pr(y = l|x, w (L) )[l - Pr(y = l|x,w (L) )], 
which for a probit link has the simple form 


( x’/3 + aiw (Z/ )) 

1 屯 （ 

x’/3 + a;w (i )) 

\V a 2^(L)a2 + 1/ 

[u 

/ a 2^(L) a 2 + ly 


Note that the relationship between mean and variance is the same as for 
Bernoulli models without latent variables. This is as expected since overdis¬ 
persion is not possible for dichotomous responses. 

4.9 Reduced form distribution and likelihood 

The conditional distribution of the observed responses y given the explanatory 
variables X is called the reduced form distribution. There are two ways of 
deriving the reduced form, via latent variable integration or latent response 
integration. The first approach is based on the specification of conditional 
independence of the observed responses given the latent variables. The second 
rests on the specification of multivariate normality of the latent responses 
marginal with respect to the latent variables. 

4-9.1 Latent variable integration 

The reduced form distribution is the distribution of the responses marginal to 
the latent variables but conditional on the explanatory variables. If the latent 
variables are discrete, this is obtained by summing the joint probabilities of 
the responses and class membership over the classes giving a finite mixture 
(see Section 3.4). If the latent variables are continuous, the latent variables are 
integrated out giving an infinite mixture. Thus the latent variable distribution 
is often referred to as the mixing distribution. When integrating over the la¬ 
tent variables at the different levels, we will exploit conditional independence 
among the units at a given level given the latent variables at all higher levels. 

Two-level random intercept model 

For simplicity, consider the two-level random intercept model 

^ij = + Cj )， 

and let the conditional probability (or probability density) of the response yij 
be denoted where ^ is the vector of fundamental param- 
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eters. The marginal distribution of the responses y》 2 for the jth. level-2 unit 
given the matrix of covariates X)( 2 ) for that unit is 

p( 2 )(yf)|X 汹 ;〜= J ) n 9 (1) ( yij | Xij , cf } ； ^)dcf), 

where /i(-) is the density of the random intercept and the product is over all 
level-1 units i within the jth. level-2 unit. This product represents the joint 
probability (density) of the responses given the random intercept since the 
responses are conditionally independent given the random intercept. 

General model 

We will use the notation of the GRC formulation, denoting the conditional 
distribution of a level-1 unit i as 夕 / ⑴ |X ⑴， C( 2+ ); 办 )， where C( z+ ) is the 
vector of disturbances at levels l and above. The multivariate distribution of 
the latent variables at level l will be denoted The conditional dis¬ 

tribution of the responses of a level-Z unit, conditional on the latent variables 
at levels l + 1 and above, is a function of the distributions of the level- (/ — 1) 
units within the unit: 

#%<olx w ,^ +1]+) ；^) = / 

This recursive relationship can be used to build up the likelihood, increasing 
l from 2 to 1/ — 1. The reduced form distribution of the responses of a level-L 
unit then is 

9 (L) (y ( L)|X( L )；^) = J /i (i) .(C (L) ) |x (i _D ， c (L) ; t?)dc (i) , 

and the reduced form distribution of all responses is the product 

5(y|X ; 妁二 n3 (L) (y W |X ⑽外 (4.34) 

Latent variable integration using Gauss-Hermite quadrature and other meth¬ 
ods is discussed in Section 6.3. 


4-9.2 Latent response integration 

We will assume that the latent responses (marginal w.r.t the latent distur¬ 
bances have a multivariate normal distribution. It follows that that uni¬ 
variate, bivariate etc. latent response distributions (marginal w.r.t other latent 
responses y*) are also normal. 

Consider now the ith latent response y* underlying the observed response 
yi. yi could for instance represent the response on the ith item (for a subject) 
in a measurement model or the response of the ith unit (in a cluster) in a 
random effects model. We can write the latent response model as (omitting 
higher-level subscripts) 

Vi = + &， 
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where the mean is given in (4.28)，the covariance matrix ⑹ of the vector 
of total residuals 之 (_l) in (4.32) and the corresponding correlation matrix p( L ) 
in (4.33). 

Univariate observed response distribution 

For an ordinal or dichotomous response the marginal probabilities of the 
response categories given the explanatory variables can be expressed as 

Pr( yi = a s ) = -L d^, (4.35) 

where (j) is the standard normal density and ojh is the ith diagonal element of 
⑹. The integration limits are just the reduced form thresholds 

_ I^S 

Tis = , —— • 

For a left-censored continuous response the probability is as above with r^ s _i = 
—oo and k s equal to the censoring limit. Here we have kept the mean struc¬ 
ture separate from the threshold structure; the mean structure is taken to be 
the part of the linear predictor that is constant across the response categories 
(e.g. Muthen, 1984). Another possibility would be to write the integral as 

pr{yi=as) = 

= 巧 4) -巧 

where 

* k s — Hi 

Tis = W 


Bivariate observed response distribution 


For two ordinal or dichotomous responses yi and , having S and T categories 
respectively, the joint response probabilities are 


T>r(yi = a s ,y^ = b t )= - 




( IM+ /V + &' • 


pii f d^d^/, 


y/WiiWifif J 

where #(.,•; pa，）is the bivariate standard normal density with correlation pu> 
between latent responses y* and y*,. 

For censored responses, the response distribution for a left censored response 
(at Kis) and an observed continuous response is 


< K is ,yi，= y^) 


^/WaWi'i' J 




(^ \ AC 


The bivariate (and trivariate etc.) distributions of all combinations between 
dichotomous, ordinal, classified and censored responses involve similar inte¬ 
grals. 


© 2004 by Chapman & Hall/CRC 




Univariate and bivariate response distributions form the basis for the limited 
information estimation method to be discussed in Section 6.7. Integrating over 
all the latent responses for a particular level-L unit gives the reduced form 
distribution for that unit g( L )(y( L )|X( L ); 办 ) • A popular method for performing 
high-dimensional latent variable integration is by simulation; see Section 6.3.4. 


4-9.3 The likelihood 

The marginal likelihood /( 办 ; y|X) is proportional to the reduced form distri¬ 
bution of all units and is considered as a function of the parameters for given 
values of the responses. 

/(^;y|X)oc 5 (y|X;t?) - 外 


4.10 Reduced form parameters 


The parameters of the reduced form for the latent variables and the reduced 
form for the linear predictor could be referred to as reduced form parame¬ 
ters. However, this term is usually reserved for parameters with the following 
properties: 

1. the reduced form parameters are functions of the fundamental parameters 


2. the reduced form parameters completely characterize the reduced form dis¬ 
tribution 


3. the reduced form distribution depends on the fundamental parameters only 
through the reduced form parameters. 

In the special case of multivariate normal latent variables and condition¬ 
ally normal observed responses, the reduced form distribution for a top-level 
unit is multivariate normal. The distribution is in this case completely char¬ 
acterized by the first and second order moments (the mean structure and 
covariance structure). It follows that in the above list of properties, ^reduced 
form distribution’ can be replaced by 4 first and second order moments’. In 
Section 4.9.2 we showed that the reduced form distribution for probit models 
with multivariate normal latent variables is completely characterized by the 
mean and threshold structure and tetrachoric correlations. These quantities 
can therefore replace ‘reduced form distribution’ in the above list of properties 
in this case. 

To illustrate these ideas for the case of conditionally normal observed re¬ 
sponses, we will now derive the reduced form parameters for a one-factor 
model with four items: 


' yi ' 


Ai 


' ei ' 

V2 


入 2 


^2 

ys 

= 

入 3 

^3 

_ 2/4 _ 


.A 4 _ 


_ ^4 _ 
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where w 〜 N(0, 岭 )， q 〜 N(O ， 0“)，and Cov(e“e《)= 0. (No intercepts are 
specified because the responses have been mean-centered.) All in all there are 
9 unknown parameters placed in the column vector 办 


= [ 入 1 ，入 2 ，久 3 ，入 4 ，分， 011 ， 022 ，沒 33 ，沒 44]’. （ 4.36) 

From the model implication E(y^) =0 and the specification of normality it 
follows that all information is contained in the second order moments of the 
Vi' 


咽 e Cov(y)= 


+ 

A20Ai ^22 

A30Ai 入 3 分入 2 入 § 矽 + 沒 33 

入 4 岭入 1 入 4 岭入 2 入 4 岭入 3 + 044 


There are in total 10 nonredundant variances and covariances which are placed 
in the vector of reduced form parameters m( 办 ), given as 


m(^) = vech(fi(t?)) 

=[A?^ + 011, A 2 ^Ai, A 3 ^Ai, A 4 t/ ； Ai, + ^22, 入 3 岭入 2, A 4 ^A 2 , 

A^ + 033, A 4 ^A 3 ,A^ + 044]. (4.37) 


In general, the situation may be more complicated since elements of the 
covariance matrix may depend on covariates and be written as polynomials 
in these covariates. The corresponding reduced form parameters in this case 
are the set of unique coefficients (up to a multiplicative constant) of these 
polynomials. For example, consider a random coefficient model with a single 
covariate Xij 

Vij = /?i + p2Xij + r]xj + V2jXij + e i：h 
having a random slope r] 2 j which is regressed on a random intercept r]ij, 

V2j = b 2 ir]ij + C2j, 

Vij = Cii ? 

where [ _ ] ~ N2 ([ S ] H ]) and 〜〜 N(O,0). The condi- 
tional variances of the responses become 

Vaxlyijlxij] = + 2 b 2 iipuXij + {b^ipn + ^ 22 )^ + 0 , 

and the covariances 


Gov[yij,yitj\xi>j,Xitj\ = - 0 n + 621 ^ 11 (^ + x V j) + ( 兒 1 畛 ii + 

If S = 1,2 ， the reduced form parameters are ^11 + 0, 621^11 ， bh^n + 岭 22, 
(the reduced form parameters of the variance for i = 1 ), 畛 11 + 0, 621^11 ， 
^21^11 + 矽 22 (the reduced form parameters of the variance for i = 2), 

& 21^11? ^ii^n + 矽 22 (the reduced form parameters of the covariance between 
i = l and i = 2), /?i，/?2 (the reduced form parameters of the mean for i = l) 
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and /?i, /?2 (the reduced form parameters of the mean for i = 2). The 6 
nonredundant reduced form parameters can then be assembled in 

m ⑼ = m 咕 + 621^11^21^11 +^22]. 


4.11 Summary and further reading 

We have introduced a general model framework unifying multilevel, structural 
equation, latent class and longitudinal models. The framework accommodates 
factors and random coefficients at different levels, regression structures among 
them and a wide range of response processes and flexible specifications of la¬ 
tent variable distributions. The framework is essentially the generalized linear 
latent and mixed model (GLLAMM) framework discussed in Rabe-Hesketh et 
al. (2004a) for the case of continuous latent variables. 

Development of the GLLAMM framework has been in parallel with the de¬ 
velopment of the Stata program gllamm available from www. gllamm. org. The 
program can estimate all the models discussed in this chapter except (cur¬ 
rently) models including both discrete and continuous latent variables and 
discrete latent variable models with more general structural models than that 
given in equation (4.15). Furthermore, the multivariate normal distribution 
is the only continuous latent variable distribution currently available. A rel¬ 
atively nontechnical treatment of the model framework with details of using 
gllamm to estimate the models is given by Rabe-Hesketh et al. (2004c). Ex¬ 
cept where stated otherwise, all models described in the Application Part have 
been estimated using gllamm. 

Several other more or less general model frameworks with latent variables 
have been suggested, some of which are mentioned here. Muthen 5 s general 
model framework (e.g. Muthen, 2001， 2002) includes latent traits as well as la¬ 
tent classes and handles continuous, dichotomous and ordinal responses. Other 
seminal contributions to multilevel structural equation modeling with contin¬ 
uous responses include Goldstein and McDonald (1988)，McDonald and Gold¬ 
stein (1989). Brief discussions can be found in Raudenbush and Bryk (2002)， 
Hox (2002) and de Boeck and Wilson (2004). Fox (2001) considers multilevel 
item response models from a Bayesian perspective. Skrondal (1996) considers 
latent trait and multilevel models with continuous, censored, dichotomous and 
ordinal responses. In the single level setting, Sammel et al. (1997) and Mous- 
taki and Knott (2000) discuss latent trait models with continuous, dichoto¬ 
mous and ordinal responses. Bartholomew and Knott (1999) and Moustaki 
(1996) discuss both latent trait and latent class models for continuous, di¬ 
chotomous and polytomous response processes. Arminger and Kiisters (1988, 
1989) discuss models with continuous, dichotomous, ordinal and polytomous 
responses as well as counts. Hagenaars (1993) and Vermunt (1997) cover latent 
class models with structural equations. 

We have derived the reduced form distribution using two approaches: latent 
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variable integration and latent response integration. An advantage of latent 
variable integration is that it handles all response types. A disadvantage is that 
conditional independence of the responses given the latent variables must be 
specified. However, this disadvantage may be partly overcome by inducing 
dependence using additional latent variables. An advantage of latent response 
integration is that it is easy to relax the conditional independence assumption, 
but a disadvantage is that it is confined to response models with a latent 
response formulation, not for instance models with a logit link or Poisson 
distribution. 
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CHAPTER 5 


Identification and equivalence 


5.1 Introduction 

The statistical models considered in this book are fairly complex, going be¬ 
yond merely description or exploratory analysis of data. A basic idea is to con¬ 
struct structural statistical models purporting to represent the main features 
of the 4 data generating mechanism’； the empirical process having generated 
the observed data. In this setting the issues of identification and equivalence 
of statistical models become fundamental, since the structural parameters of 
scientific interest differ from the reduced form parameters. Broadly speaking, 
identification and equivalence concern the prospects of making inferences re¬ 
garding the structural model based on the reduced form distribution for the 
observed variables. 

A parametric statistical model is said to be identified if there is a unique 
value of the parameter vector ^ or parameter point that can generate a given 
reduced form distribution 分 (y|X, 办) • If a model is not identified there are 
several sets of parameters that could have produced the reduced form dis¬ 
tribution. The model parameters are in this sense arbitrary, a situation that 
is detrimental for scientific inference. A crucial implication is that consistent 
parameter estimation is precluded (e.g. Gabrielsen, 1978). 

The related concept of equivalence concerns the prospects of distinguish¬ 
ing empirically between different statistical models. If two models are merely 
reparameterizations, producing identical reduced form distributions, they are 
equivalent and equally well compatible with any data. Equivalence is of course 
detrimental when the models represent different and perhaps contradictory 
substantive data generating mechanisms. On the other hand, we can some¬ 
times take advantage of equivalence to simplify estimation problems as is 
shown in Section 5.3.2. 

In Section 5.2, we present some useful definitions of identification. Analytic 
investigation of identification proceeds by studying the properties of the map¬ 
pings between reduced form parameters, which are presumed to be identified, 
and the fundamental parameters. This approach is illustrated for a number of 
models. Finally, empirical identification is considered. 

In Section 5.3 we first present definitions of equivalence and then discuss the 
analytic approach to equivalence. This involves investigating whether there is 
a one-to-one transformation between parameterizations generating the same 
reduced form distribution. After illustrating this analytic approach using some 
examples, we also consider empirical equivalence. 








Our treatment is fairly informal, focusing on how analytic investigation of 
identification and equivalence can proceed in practice, and we will refer to 
the literature for deeper insight. We will in particular consider methods that 
are straightforward to implement in software for computer algebra such as 
Mathematica (Wolfram, 2003) or Maple (Maple 9 Learning Guide, 2003). 

Although the reader should appreciate the importance of identification and 
equivalence, he or she may wish to skip some of the more technical parts of 
this chapter. 

5.2 Identification 

5.2.1 Definitions 

The following definitions are useful: 

• Two parameter points 办 工， 汐 2 are called observationally equivalent if they 
imply the same reduced form distribution for the observed random vari- 
ables; 5 (y|X;i? 1 ) = 5 (y|X;t9 2 ). 

• A parameter vector A is globally identified if for any parameter point 

G A there is no other observationally equivalent point i9 2 G A. 

• A parameter point G A is locally identified if there exists an open neigh¬ 

borhood of 办 0 containing no other 汐 which is observationally equivalent to 

It should be noted that local identification everywhere in the parameter 
space ^4 is a necessary but not sufficient condition for global identification, 
see Bechger et al. (2001 ， p.362) for an example. Also note that the models 
considered in this book typically imply nonlinear moment structures. It follows 
that local identification at one point in A does not imply local identification 
everywhere in A and that parameter points can often be found which are 
not locally identified (see Section 5.2.4 for an example). Hence, many of the 
models considered in this book are likely not to be globally identifiable and we 
must resort to the weaker notion of local identification (see also McDonald, 
1982). 

5.2.2 Methods for analytical investigation of local identification 
Recall the unidimensional factor model 

Vij = +XiVj + Qj， Vj 〜 N( 7 , 岭 ) ， ％ • 〜 N(0, 0 “)， 

from Section 3.3.2. By considering a linear transformation of the factor, fj = 
ar]j + c, we can write the model as 

_ = (ft — \c/a) + (Xi/a)fj + eij, fj ~ N(<rr + 病 a 2 ip), ~ N( 0 , 6 »“） 

= 13* + Kfj+eij, / 广 N( 7 *,^), eij - N( 0 , 知)， 
where 

Pi = A _ \c/a, X* = Xi/a, 7 * = a 7 + c, = a 2 水 
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Therefore, different parameter points generate the same reduced form distri¬ 
bution and the model is not identified. In order to identify the model we could 
fix the mean and variance of the factor for instance to zero and one, respec¬ 
tively. Although it was easy to demonstrate that the above model was not 
identified, it is not straightforward to show that the suggested parameter re¬ 
strictions render the model identified. In fact, it can be shown that the model 
is not identified with fewer than three items. 

Under suitable assumptions, a necessary and sufficient condition for local 
identification at a parameter point is that the (theoretical) information matrix 
is nonsingular at the point (e.g. Rothenberg, 1971). In principle this condi¬ 
tion can be used for investigating identification, but this approach is usually 
not feasible in practice because the information matrix is usually analytically 
intractable in complex models. 

In the special case where reduced form parameters exist (that completely 
characterize the reduced form distribution), as for the normal case, the stan¬ 
dard approach to identification instead focuses on the mappings between fun¬ 
damental and reduced form parameters (see Section 4.10). This approach, 
which yields necessary and sufficient conditions for local identification, ap¬ 
pears to be due to Wald (1950); see Fisher (1966) for a survey. Dupacova and 
Wold (1982) applied this idea to conventional structural equation models with 
latent variables. 

In this chapter we extend the mapping approach beyond normal responses 
to models with dichotomous and/or ordinal responses generated from nor¬ 
mal latent responses crossing thresholds, i.e. models with probit links. This 
is possible since reduced form parameters exist in this case that completely 
characterize the reduced form distribution. For models where there are no re¬ 
duced form parameters that completely characterize the distribution, it may 
nevertheless be useful to consider the mapping between fundamental parame¬ 
ters and the reduced form parameters of the first and second order moments, 
the idea being that identification relying on higher order moments is likely to 
be fragile. For example, for dichotomous or ordinal responses, identification 
is likely to be fragile for logit models, if the analogous probit model is not 
identified (e.g. Rabe-Hesketh and Skrondal, 2001). 

The fundamental parameters ^ G A produce reduced form parameters m G 
A f via mappings 

m s = h s (-i?), l<s<S, 

where h s (.), 1 <s<5, are continuously differentiable known functions. 

The probability distribution of the observed variables depends on the fun¬ 
damental parameters ^ G A only through the 5-dimensional reduced form 
parameter vector m, 

^(y|X;^) = ， (y|X; 婦 ) ， …, M 办 _ ^(y|X;m) for all A, 

where g* is the distribution in terms of the reduced form parameters. Iden¬ 
tification of ^ can therefore be investigated by considering characteristics of 
mappings from ^ to m. 









Consider a particular fundamental parameter vector 汐 0 generating the re¬ 
duced form parameter vector m°, 


m° s = l<s<S. 

Then is identifiable if and only if is the unique solution of the equations 
m° s = ⑼， l<s<5. (5.1) 


Hence, the identification of depends solely on the properties of the map¬ 
pings h s (-). The identification problem therefore reduces to the question of 
uniqueness of solutions to systems of equations so that we can use classical 
results of calculus. 

It is evident that a necessary but not sufficient condition for identification 
is that there are at least as many elements in the reduced form parameter 
vector as there are unknown parameters; v < S. In order to derive stronger 
identification results, it is useful to define the Jacobian of the mapping 

J ⑼ = [尝，- 

A parameter vector 汐 0 is a regular point if there is an open neighborhood of 
in which the Jacobian has constant rank. If we know nothing about 办 0 
except that € Ait makes sense to assume that it is a regular point, since 
almost all points in A are regular points. However, we will encounter a case 
where is not a regular point in Section 5.2.4. 

If 办 0 is a regular point, the system of equations (5.1) has the unique solution 
办 o if and only if the rank of the Jacobian is equal to the number of fundamental 
parameters v. The analysis of identification in this chapter will therefore rely 
on the following Lemma: 


Lemma 1: If 办 0 is a regular point of then is locally identified if 

and only if Rank [J(^ 0 )] =v. 


5.2.3 Applying the Jacobian method for local identification 
One-factor model with four continuous items 


We return to the one-factor model with four continuous items introduced in 
Section 4.10, whose parameter vector ^ and reduced form parameters m(^) 
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with Rank[J(i^)] = 8, so the model is locally identified if 汐 is a regular point. 


Structural equation model with six continuous items and correlated errors 


We now consider a structural equation model for panel data with three la¬ 
tent variables discussed by Wheaton et al. (1977). The model is recursive in 
that one latent variable serves as explanatory variable whereas the other two 
represent the response variables at two panel waves or occasions. Each latent 
variable is measured by two items, where one item serves as anchor for each 
factor. An important feature is that errors for repeated measures of the same 
item are correlated, making investigation of identification by means of pen 
and paper quite complex (see also Joreskog and Sorbom, 1989, p.173-174). A 
path diagram of the model is shown in Figure 5.1 where the e^- are represented 
by small circles. 

The measurement part of the model (corresponding to equation (3.31)) is: 


with Rank[J(i9 )] = 8, so the model is locally identified after anchoring if 分 
is a regular point. 

We then investigate whether the factor model is identified after 4 fact or stan¬ 
dardization ? where we instead fix the factor variance to an arbitrary nonzero 
constant, typically 畛 =1. To distinguish the parameters from those under an¬ 
choring we put a bar above the symbol. Setting # = 1 in (5.2) and omitting 
the fifth column, the Jacobi matrix becomes 


oooooooool 

oooooooloo 

oooolooooo 

looooooooo 

o o olAlo 0-A20A32A4 
o olAlo OIA202A3A40 
Ol Alo 0-2A2A3A40 o o 
-2A1IA2A3A40 o o o o o 

II 

J( 
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and & be correlated, e.g. 岭 32 — 0. However, we then obtain Rank[J(i?)] = 17 
for 18 parameters, demonstrating that this model is not locally identified. 


One-factor model with four dichotomous items 


Consider the following one-factor model for four continuous underlying or 
latent response variables: 


'yh ' 


'/3i ' 


1 


e li 

V2j 




入 2 


e 2j 

yh 


(h 

+ 

入 3 

Vj + 

e 3j 

. vlj _ 


(h 


.A 4 _ 


_ ^4j 


where % 〜 N(0, 岭)，％ 〜 N(0, On) and Cov(e^,e^) =0. Note that we have 
‘anchored’ the factor by imposing Ai = l. 

There are 12 unknown parameters 

汐 =m 乂2, X3 , Xa ：， 也 沒 11，沒22,沒33,沒44]’， 


and the covariance matrix of the latent response variables becomes 


聊） = 


畛 + 0ii 

入 2 岭 + ^22 

X ㈣ Xs 咕乂 2 入 1*0 +沒 33 

入 4 岭 入 4 #入 2 入 4 岭 A3 A | 畛+ 044 


The present model is identical to the one-factor model for four continuous 
items including intercepts with the crucial difference that the underlying 
or latent response variables are not observed. Instead, the latent response vari¬ 
ables are related to observed dichotomous responses via threshold functions 


Vij 



if 

otherwise. 


The marginal probability Pr(y^- = 1) becomes 



where cj) is the standard normal density, ^ = Air^ + ey，and 伞 is the cumu¬ 
lative standard normal distribution. The means \ii of the latent responses y*j 
underlying the items i are 



and are identified from the marginal probabilities. The joint response proba¬ 
bilities Vstuv 三 Pr(yi = s,y 2 = t ， y 3 = u,y 4 = v) can be expressed as 

1 r~ T s r-Tt r-r u r-r v 

P — \/(# + 沒 11 )(¥畛 + 沒 22 )( 入 §0 + ( 9 33 )( 入 ^0 + ( 944 ) J-h — j-Tt — j- t u _J 


Pi +$1 P 2 + $2 


為+ $3 


04+(4 


’ V / a |^ T ^22 , V ^ 3 嗲+ 沒 33 ’ a / A ^ + ^44 


R I d^id^2d^3d$4 




where #(•，., .，.； R) is the four dimensional standard normal density with tetra- 
choric correlation matrix 

R(i?) = diag(fi(t?))- 1 / 2 n(i?)diag(0(i?))- 1 / 2 



The tetrachoric correlation matrix is well known to be identified. 

All in all there are 10 reduced form parameters, 4 means and 6 nonredundant 
tetrachoric correlations, which are assembled in 

_)=[ 汍 灸 Ps /?4 

沒 22’ v/A 鉍 +0 33 ’ ^x\ip+e AA ' 

入 2 岭 Xs^ 

022^*0 + ^11 \! 入! 4 + 033 X ^ + 011 \/入1 矽 + ~44 V ^ + ^11 
入3分入2 入40入2 

_ 入4必入3 _ | 

In this case we need not proceed to obtain the rank of the Jacobian since there 
are more unknown parameters, 12, than identified reduced form parameters, 
10. Thus it is obvious that the model is not identified. 

Consider now fixing the variances of the errors e^- to 1, e.g. 0n = 沒 22 = 沒 33 = 
044 = 1. There are now 8 unknown parameters 

^ = [/?1，灸，馬，/?4,入2,入3,入4,别’ 

and the 10 identified reduced form parameters become 

m(i?) ^ [ f 1 ，—^ — , 兔 _ 汍 , 

[v^H-T v^iV^ 1 VAl^+l 


A2 岭 Xs^ 入 4 岭 



入 30 入 2 入 4*0 入 2 


v /A§-0+l v / A|V ; H-l ， \/ 入 !0+1 彳入 ! 矽 +1’ 

va ^+ vW + i ] ' (5 _ 3) 

The 10x8 Jacobian J (句 becomes complicated and too huge to be presented 
in this case, but the main point is that Rank[J ⑼] is 8, so the model is locally 
identified if 汐 is a regular point. 




R ⑼ ： 


where the off-diagonal elements are identified. From the marginal probabilities 
Pr(t/y=l) : 

we can identify 
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[ij 〜 N(0,1) and Cov(eij, e^j) = 0. The latent response variables are related 
to observed dichotomous responses via threshold functions 


Vij 


* 1 if 2/* >0 

0 otherwise. 


The model is not identified since the variances of the latent response errors 
are not identified as shown above for the one-factor model with dichotomous 
items. 

Rabe-Hesketh and Skrondal (2001) suggested that a 2 be fixed at a p 
value to ensure identification. Actually, a 2 cannot be fixed to any positi 
value (Rabe-Hesketh and Skrondal, 2001, p. 1258) but we will for simplicity 
impose a 2 = 1 here. There are then 6 unknown parameters 

implying the tetrachoric correlation matrix 


Coull and Agresti model for four dichotomous responses 

Coull and Agresti (2000) suggested a multivariate binomial logit-normal (BLN) 
model. In their first example they specified a simple model with a separate 
intercept for each of four occasions and no other covariates. We will here con¬ 
sider the probit-normal version of their model, denoted the BPN model in 
Rabe-Hesketh and Skrondal (2001): 
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The identified reduced form parameters are placed in 


•[I 


P2 Ps /?4 Pi Pi P2 Pi P2 P2 


and the 10x6 Jacobian becomes 


m = 


'75 
0 
0 
0 
0 
0 
0 
0 
0 
. 0 


0 0 

71 0 

0 忐 


0 0 0 - 
0 0 0 
0 0 0 

75 0 0 

0 I 0 

0 j 0 

00 ! 

0 § 0 
0 0 I 

0 0 


(5-4) 


where Rank[J(i?)] = 6, so the model is locally identified. Also note that 汐 
is not involved in the Jacobian, so the model is locally identified throughout 
parameter space. 


One-factor model with three ordinal items 

Consider the following one-factor model for three continuous underlying or 
latent response variables: 


yh 


- Pi¬ 


Ai 



yh 

= 

th 

+ 

A2 


e 2j 

.yh - 




■入 3 _ 


_ € 3j _ 


where rjj 〜 N(0,1), ey 〜 N(0,6u) and Cov(e^,e^) = 0. The latent response 
variables are related to observed trichotomous (three-category) responses via 
threshold functions with constant thresholds across items 

r 0 if y* <0 

va = { 1 if o<y*j<n2 

{ 2 if K 2 <y* j . 

There are 10 unknown parameters 

^ = [« 2 , /?1 ? /? 2 , /?3j Ai, A 2 , A 3 , 0n, 022, 0ss] f . 

The marginal probabilities are 

Pr ( 恥 =0) = ^(^==)- (5_5) 


Pr( yij = l) 


W^+e~i 
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were unnecessarily restrictive. He showed that it suffices to fix one of the error 
variances to obtain local identification. Without loss of generality, we fix the 
error variance of the first item {0\\ 二 1). There are now 9 unknown parameters 


^ = [«2, /?1, /?2, Ps, Ai, A2, A3, 022 , Oss]\ 
and the 9 identified reduced form parameters are placed in 

_ I _ Pi _ p 2 _ Ps _«2_ 1^2 _^2 

a /入? +1 a /^2+^22 \/入1 +沒 33 V ^1 + 1 V 入!+ 沒22 a / 入 § + 沒 33 
W A3A1 A3 

\J 入| + 沒22\/入? + 1 V ^3^~ ^33 +1 \! A | + ^33 

The 9x9 Jacobian, which is too huge to be presented, has Rank[J(i?)] = 9, so 
the model is locally identified as shown by Skrondal (1996) if 办 is a regular 
point. We will use this parametrization in investigating the life-satisfaction of 
Americans in Section 10.4. 

Consider now the model where the error variances are constrained to one 
as in Muraki (1990), but the threshold is permitted to vary across items. The 
threshold functions become 



r o if y* <o 

Vij = < 1 if 

{ 2 if ^2<y*- 

where we note that the thresholds have index i. There are now 9 unknown 
parameters 


^ = [«12, «22, «32, A, /?2, Ps, Ai, A 2 , As] 7 , 

and the 9 reduced form parameters are placed in 

P2 _ Ps 1^12 ^22 托 32 

\ /入 2 + 1 a / 入 3 + 1 a /^1 + 1 \ /入 ! + l + 1 

A2A1 A3A1 A3 

•^/ A| +1 -\/ +1 /A!+1 +1 + l 




The 9x9 Jacobian has Rank[J (^)] = 9, so the model is locally identified if 汐 
is a regular point. 








The 10 x 9 Jacobian becomes 
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with Rank[J(i?)] = 9, so the model is locally identified if 办 is a regular point. 

Consider now the special case where the factors are uncorrelated, -021 = 0. 
Substituting this restriction in the above Jacobian, the rank becomes 7. It is 
thus evident that 汐 0 with ^21 =0 is not a regular point since the rank of the 
Jacobian in this case is not constant in the neighborhood of i9°. Furthermore, 
the model is not locally identified for - 021=0 since the rank of the Jacobian is 
7 for 8 unknown parameters ( 入 21 ，入 42 , ^ 11 ， ^ 22 , ^ 11 , 沒 22 ,沒 33 , 沒 44 ). We are thus 
in the somewhat unusual situation where a model becomes locally identified 
by relaxing (not imposing) a parameter restriction. The above model nicely 
illustrates that Lemma 1 can only be applied if the parameter point is a regular 
point 


Two-factor model with four continuous items and exogenous variable 

Consider now the extension of the above model where the factors are regressed 
on a covariate Xj 

where [ Cy ] ~ N2 ([ 0 ] 5 [ 也 2 D • A path diagram of this model 
is given in the right panel of Figure 5.2. 

The unknown parameters are 

办 =[7ll ， 721 ， 入21 ， A42 ， *011 ， #21 ， #22 ， 011 ，沒 22 ，沒 33 ， @44]’. 

We obtain the regression/mean structure 

E OijK.) = 711 ^' 

^(V2j\Xj) = MxlllXj 

HV3j\Xj) = -/ 2 iXj 
^{y4j\Xj) = \42lf2lXj, 

and the vector of reduced form parameters becomes 

m (办） =[7 ii ， 入21711， 721, A4272I ,-011+011, 久21畛11，矽 21 ，入42#21，+ 022， 
^ 21 A21 , A42 "021 A21 , *022 + ^33 ?久42#22, % 2 矽 22 + 沒44]. 




The 14 x 11 Jacobian becomes 
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with Rank[J(i?)] = 11, so the model is locally identified if is a regular point. 
Consider once again the special case where ^21 = 0; the factors are uncorre¬ 
lated. Substituting this restriction in the above Jacobian the rank remains 11, 
so ^21 = 0 no longer implies that 汐 is an irregular point, in contrast to the 
case without a covariate. 


5.2.5 Empirical identification 

Analytical identification proceeds in terms of unknown true parameters 办 • A 
useful complement to this ‘theoretical’ approach is ‘empirical’ identification 
which is instead based on properties of estimated parameters. Although less 
stringent than the analytical method since it is based on estimated parameters 
instead of theoretical parameters, the empirical method has some advantages 
as compared to the former approach: First, empirical investigation is based 
on the estimated information matrix, a natural byproduct of maximum likeli¬ 
hood estimation. Second, empirical identification is more general in the sense 
that it does not rest on the existence of globally identified reduced form pa¬ 
rameters that completely characterize the reduced form distribution. Third, 
it can be argued that empirical identification assesses identification where it 
matters, at the parameter estimates. For instance, inferences^,re expected to 
be problematic for the two-factor model in Section 5.2.4 if -021 ~ 0. Fourth, 
empirical identification addresses problems that may be inherent in the sam¬ 
ple on which inferences must be based. Collinearity among predictor variables 
in linear regression is an example of an empirical identification problem. 

Inspired by Wiley (1973) and McDonald and Krane (1977), we suggest the 
following definition: 

• A model is empirically identified for a sample if the estimated information 
matrix at the maximum likelihood solution ^ is nonsingular. 

Note that this condition is simply an empirical counterpart of the condition 




based on the theoretical information matrix (e.g. Rothenberg, 1971) men¬ 
tioned earlier. 

A measure of how close a matrix is to singularity is the condition number, 
defined as the square root of the ratio of the largest to the smallest eigenvalue. 
In practice, we say that a model is empirically underidentified if the condition 
number is ‘large’，exceeding some threshold. When a model is empirically un¬ 
deridentified, standard errors and intercorrelations of parameter estimates will 
be high. We would for example expect this scenario when there is collinearity 
among predictor variables and for the two-factor model in Section 5.2.4 when 
-021 ~ 0 . 


The binomial logit-normal (BLN) model 

We now consider the BLN model discussed by Coull and Agresti (2000), the 
logit version of the BPN models discussed above. Rabe-Hesketh and Skrondal 
(2001) argued that the BLN model is not identified from the first and second 
order moments and is therefore likely to be empirically underidentified since 
information in higher order moments is likely to be scarce. 

Estimating the model without constraining cr 2 ，the condition number is 
179.5 which is extremely large (the smallest eigenvalue was less than 0.004) 
and indicates that the observed information matrix is nearly singular. Thus, 
the BLN model appears to be empirically unidentified. We also estimated 
the model constraining a 2 equal to its maximum likelihood estimate of 4.06, 
giving a condition number of 5.2. 

Inverting the estimated information matrices, we obtained the estimated 
covariance matrices of the parameters estimates. As can be seen in Table 5.1, 
the estimated standard errors decrease substantially when a is fixed. The 
correlations of the parameter estimates are shown in Table 5.2. For the un¬ 
constrained model, the parameter estimates are highly intercorrelated, most 
correlations approaching 士 1， the smallest correlation (in absolute value) being 
—0.79, whereas the highest correlation for the constrained model is 0.19. 

Having demonstrated empirical underidentification, we now investigate if 
this is due to the scarce information in the higher order moments. For a 
range of values of a, we computed the other parameters to preserve the means 
and correlations implied by the maximum likelihood solution. The models 
with these different sets of parameter values imply identical first and second 
order moments of the latent responses but different higher order moments. 
The deviance of these models is plotted against a in Figure 5.3 where a 
increases from 1.35, the lowest value consistent with the correlations of the 
latent responses, to 8. The deviance hardly changes at all although the higher 
order moments were deliberately ignored in determining the other parameters 
for each value of a. This provides direct evidence for the scarcity of information 
in the higher order moments of the latent responses. Note that the curve in 
Figure 5.3 represents an upper bound for the deviance corresponding to the 
profile likelihood (see Section 8.3.5) for a since the other parameters are not 


© 2004 by Chapman & Hall/CRC 


Table 5.1 Parameter estimates, standard errors and deviance for constrained and 
unconstrained versions of the BLN model (20 quadrature points per dimension) 


Standard Error 

Est Unconstrained Constrained 



th 

P2 

03 
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Pi 

P2 

/3i 

1 

0.190 

0.187 

-0.083 

- 

0.050 

-0.025 

P2 

0.997 
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0.185 

-0.080 
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0.062 

-0.033 

Ps 

0.997 

0.998 

1 

-0.077 

一 

0.069 

-0.037 

Ha 

0.997 

0.997 

0.997 

1 

一 

0.014 

-0.045 

a 

-0.998 

-0.998 

-0.999 

-0.998 

1 

— 

— 

Pi 

0.941 

0.942 

0.942 

0.941 

-0.942 

1 

-0.106 

P2 

-0.814 

-0.815 

-0.815 

-0.815 

0.815 

-0.788 

1 


Source: Rabe-Hesketh and Skrondal (2001) 


estimated by maximum likelihood. Estimating the model with a fixed at 8.2, 

for example, gives a deviance of only 6.53. 

5.3 Equivalence 

5.3.1 Definitions 

• Two statistical models M\ and M2 with and ^B are globally 

equivalent if they are reparameterizations in the sense that there exist one- 
to-one transformations between and throughout A and B making 
them observationally equivalent. 

As was the case for identification in nonlinear moment structures, the prospects 
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Standard deviation a 


Figure 5.3 Deviance for different values of the standard deviation a. The other pa¬ 
rameters have been computed to preserve the first and second order moments implied 
by the maximum likelihood solution. (Source: Rabe-Hesketh and Skrondal, 2001) 

for global equivalence seem to be bleak, and we resort to a notion of local 
equivalence: 

• Two statistical models M.\ and M2 with locally identified parameter points 
^ €.4 and are locally equivalent if they are reparameterizations in 

the sense that there exists a one-to-one transformation between and 
in open neighborhoods of the points making them observationally equiva¬ 
lent. 

5.3.2 Analytical investigation of equivalence 

We propose the following approach for models that are completely char¬ 
acterized by lower order moments. Consider the reduced form parameters 
m s = h s (^ a) and m* = of two potentially equivalent models with 

fundamental parameter vectors and 'Ob- If the models are equivalent it 
follows that for each parameter point ^ there is a point such that 

hsM = 

We then investigate whether and under which conditions one-to-one transfor¬ 
mation between the fundamental parameters of the two models can be found 
by solving for in terms of and for in terms of 

For the special case where we want to investigate local equivalence for two 
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submodels A4i and M2 of a common underidentified model M12, we consider 
the following approach suggested by Luijben (1991). Let 办 12 denote the vec¬ 
tor of v fundamental parameters of M12 - Define Mo as the model where the 
restrictions on M12 leading to Mi as well as M.2 are imposed. Also let 
assumed to be a regular point of AI 12 , be a restricted version of 办 12 where 
the restrictions leading to M.q are imposed. Under the restrictions yielding 
the parameter vectors of M.\ and M2 are assumed to be locally identi¬ 
fied and regular points. Define the Jacobian J(^ 12 ) |^i 2 =<l ji 2 as the Jacobian 
J(i? 12 ) of the mappings from fundamental parameters 汐 12 to the reduced form 
parameter vector m 12 with the restrictions ininserted. 

The analysis of equivalence can in the present setting rely on the following 
Lemma: 

Lemma 2: M.\ and M2 are locally equivalent if and only if 
Rank [ J (办 12 ) | 汐 12 =^ 12 ] <v. 

As for the Jacobian strategy for identification，this approach for investigating 
equivalence is confined to models where there are reduced form parameters 
completely characterizing the reduced form distribution. 

Unfortunately, we cannot use Lemma 2 when no common underidentified 
model is known in which the two possibly equivalent submodels are nested. 
Bekker et al. (1994) point out that Luijben’s approach is quite restrictive and 
consider investigation of local equivalence for the general case. Unfortunately, 
the derived conditions appear to be extremely difficult to evaluate in practice. 

One-factor model with four continuous items 

Let us once more return to the one-factor model with four continuous items 
introduced in Section 4.10. On pages 139-140 we demonstrated that there were 
two locally identified models，one with anchoring (Ai = 1) and another with 
factor standardization (-0 = 1). 

We proceed by equating the reduced form parameters produced by the two 
parametrizations: 

m ( 办 ） =[Ai + 9 n, A2A1, A3A1, A4A1, A| + 622, A3A2, A4A2, 

^3 + ^33 5 A4A3, A4 + S44] 

=Kr.Kr, Kr , ⑹ v+ 巧 2 , 

(xt) 2 r+oh,x ： rK, (Kfr+oi,] = m(tn. 

Solving for 汐 * gives us unique solutions 

^11 = ^11? ^22 = ^22? ^33 = ^33 5 ^44 = ^44? ^2 = A3 = 

^2 = 每，咕 * = ^i| • 

The standardized model can apparently generate parameters throughout the 
parameter space of the anchored model as long as Ai^O. 
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Solving for 办 ， we obtain unique solutions, subject to determination of the 
sign of the factor loadings 

1^11 = 朽1，沒 22 = 巧2,沒 33 = 的3,沒 44 = %4,叉 1 = 士 V^， 叉2 = 士入^ 

A 3 = = ±A4\/^*| • 

It seems that the anchored model can generate parameters throughout the 
parameter space of the standardized model if ^*>0. 

Note that both models are nested in the nonrestricted model. Hence, we can 
apply Lemma 2 to investigate whether the models are locally equivalent. Sub¬ 
stituting the restrictions Ai = 1 (anchoring) and 畛 =1 (factor standardization) 
in the Jacobian for the nonidentified model given in (5.2) produces 


J (^ 12 ) W 2 =^ 


The rank of this Jacobian, Rank [J ( 汐 12 ) |^i 2 _^i 2 j, is 8 which is one less 
the number of parameters, so the models are locally equivalent under the 
assumptions stated above. 

Although equivalent parametrizations, anchoring is often regarded as prefer¬ 
able to standardization since it ensures factorial invariance (see Section 3.3.2). 


BPN model and restricted dichotomous one-factor model 

We now investigate whether the identified BPN model presented on page 
144 is equivalent to the restricted version of the one-factor model with four 
dichotomous indicators introduced on page 142. In this case we are not aware 
of any common underidentified model in which the two models are nested and 
use of Lemma 2 is precluded. 

We therefore proceed by substituting the restrictions 入 ；l = 入2 = 入3 = 1 
into (5.3), obtaining the reduced form parameters 
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Equating these with (5.4), but letting the intercepts now be denoted bi to 
distinguish the parameters, we solve for the parameters of the restricted one- 
factor model in terms of the BPN parameters and obtain: 




^ ，处 =^7!^^ 3 = ^71^ ， 

a _ &4 V 1 — ±V-2 + Pi P2 , _ —Pi \ 

〜= 7 ! ， 4 = # v /-2 = 


The solutions are unique apart from the sign of A 4 and the BPN model can 
apparently generate parameters throughout the parameter space of the re¬ 
stricted one-factor model. 

Solving for the BPN parameters in terms of the parameters of the restricted 
one-factor model, we get 







户 /?4 
\/l + 畛 A? 


— _ 20 入 4 

pl = i +^ ,p2 = 

The factor variance ^ is obviously nonnegative, ^>0. From p\ = 2^/(1 + 
it follows that pi > 0 ; the restricted one factor model cannot generate negative 
pi for the BPN model. The BPN model and the restricted one factor model 
are thus not globally equivalent. 

Importantly, as long as the restriction pi > 0 is reasonable we can make use 
of the equivalence to greatly simplify estimation of the BPN model. Instead of 
having to integrate over four random effects to obtain the marginal likelihood, 
we need only evaluate a one-dimensional integral (Rabe-Hesketh and Skrondal, 
2001 ). 

Another and perhaps more prominent example of lack of global equivalence 
concerns a multivariate linear model with a compound symmetric residual co- 
variance matrix and a random intercept model. The former specifies variances 
equal to A and covariances equal to B. For the random intercept model the 
implied marginal variances and covariances become ^-\-9u and # ， respectively. 
Note that the covariances are necessarily non-negative in the random inter¬ 
cept model since 畛 > 0 whereas the covariances B need not be positive under 
compound symmetry (see also Lindsey, 1999). 


5.3.3 Empirical equivalence 

Analogously to the case for identification, we now consider ‘empirical’ inves¬ 
tigation of equivalence: 

• Two models are empirically equivalent for a sample if there are one-to-one 
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functions relating the parameter estimates across models such that almost 
identical likelihoods are produced. 

As for empirical identification, investigation of empirical equivalence is less 
formal than 4 theoretical’ equivalence since it is based on parameter estimates. 
However, we must in practice resort to empirical equivalence for models that 
are not completely characterized by say their first and second order moments, 
since this renders the theoretical approach unfeasible. 


BLN and restricted dichotomous one-factor models 


We now consider empirical equivalence between the identified BLN model 
and the restricted one-factor logit model. The deviances of these models were 
6.28 and 6.58, respectively. The deviances are so close since, as expected, the 
implied first and second order moments are nearly identical for both models: 
The means are respectively estimated as jSi=—0.91,—0.87, ；U2=—0.99,—0.96, 
1.05,—1.03, //4=—1-03,—0.97 and the correlations of the latent responses 
are estimated as P12 = pis = P22=0.36,0.34 and pi4 = ^24 = ^34=—0.21,—0.21. 

We can transform the parameter estimates of the one-factor model into 
estimates for the BLN model using the equations 


b r = 


R Vo- 2 + 7T 2 /3 

V^ + ^ 2 /3 ! 


r = 1,2,3 


b 4 = Yf +7rV3 

\/A^ + ?r 2 /3 

_ ^(cr 2 +7T 2 /3) 

Pl 答 (7 2 (^ + 7r 2 /3) 

n _ A 4 -0(a 2 + 7T 2 /3) 

p2 + 7T 2 /3)(A^ + 丌 2/3) 


where we substitute the maximum likelihood estimate of 4.06 for a. The result¬ 
ing estimates (—3.87, —4.27, —4.55, —4.30, 0.41, —0.25) are close to the max¬ 
imum likelihood estimates for the BLN model (—4.04, —4.42, 一4.69，—4.56, 
0.43, -0.25). 


5.4 Summary and further reading 

We have defined identification and equivalence and shown how both properties 
can be investigated analytically as well as empirically. Jacobians, their rank 
and a basis for their nullspace are easily obtained using computer algebra. For 
simplicity, we have imposed parameter restrictions by direct substitution in 
the Jacobians. Alternatively, we could have augmented the Jacobian used in 
this chapter with a Jacobian of a restriction matrix (e.g. Rothenberg, 1971). 

Our discussion has been confined to identification of parametric models. 
A more daunting task is ‘nonparametric’ identification, where the models are 
characterized by constraints on functions (generally not parameterized). Iden- 
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tification then concerns whether there are more than one set of functions gener¬ 
ating the same distribution for the observations. Nonparametric identification 
for proportional hazards models with latent variables has been extensively 
studied in econometrics (e.g. Brinch, 2001; Van den Berg, 2001). 

We have only considered identification for models where the fundamental 
parameters 汐 are unknown constants. In the Bayesian setting, Chechile (1977) 
discusses the notion of 4 posterior-probabilistically identified’ and demonstrates 
that models may be identified in this sense although they are not 4 likelihood- 
identified 5 . An example where identification is achieved through the prior dis¬ 
tributions is discussed in Knorr-Held and Best (2001). 

It is worth noting that the use of Markov Chain Monte Carlo (MCMC) 
methods or other simulation methods (see Chapter 6) may be dangerous from 
the point of view of identification. As stated by Keane (1992, p.193), this 
is because “simulation error will generate contours where the true objective 
function is flat and will generate a nonsingular Hessian when the true Hes¬ 
sian is singular”. Keane illustrated the danger by referring to Horowitz et 
al. (1982) who did not discover that a particular multinomial probit model 
was nonidentified. 

Equivalence is not only an issue for the latent variable models discussed in 
this book. For instance, MacCallum et al. (1993) point out that the multidi¬ 
mensional scaling models for three-way proximity data suggested by Tucker 
(1972) and Carroll and Chang (1972) are equivalent, although they represent 
widely different representations of individual differences in judgment tasks. 
Equivalence may also involve different types of models, for instance latent 
class and Rasch models (e.g. Lindsay et al” 1991; Heinen, 1996) or factor 
models for continuous responses and latent profile models (e.g. Bartholomew, 
1987, 1993; Molenaar and von Eye, 1994). 

A modern and comprehensive discussion of identification and equivalence 
for parametric models, including formal definitions，assumptions and theo¬ 
rems, is provided by Bekker et al. (1994). Useful treatments of identifica¬ 
tion include Koopmans and Reiers0l (1950), Wald (1950), Anderson and Ru¬ 
bin (1956), Fisher (1966), Geraci (1976), Rothenberg (1971), Dupacova and 
Wold (1982), Hsiao (1983), Rabe-Hesketh and Skrondal (2001) and Bechger 
et al. (2001). Contributions to the equivalence literature include Stelzl (1986), 
Breckler (1990), Joreskog and Sorbom (1990), Luijben (1991), McCallum et 
al (1993), Hershberger (1994), Raykov and Penev (1999), Rabe-Hesketh and 
Skrondal (2001) and Bechger et al (2002). 

The identification problem has been given ample attention in econometrics 
where complex structural models have been used for a long time. This stands 
in contrast to biometrics, where much simpler models have traditionally been 
used. The equivalence problem appears to have attracted most interest in 
psychometrics. However, identification and equivalence should definitely be 
given more attention throughout statistics due to the increasing popularity of 
highly structured models. 
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CHAPTER 6 


Estimation 


6.1 Introduction 

In this chapter we describe a number of estimation methods that have been 
proposed for latent variable models belonging to the general model framework 
presented in Chapter 4. We believe that a relatively nontechnical overview 
of different methods is useful, since some of the methods are alien outside 
particular methodological disciplines. The estimation methods are sketched in 
more or less detail, referring to the pertinent literature for technical details. 
An incomplete overview of software implementing the different estimation 
methods is provided in an appendix. 

We also consider the strengths and weaknesses of different methods. The es¬ 
timation methods turn out to be quite heterogeneous according to criteria such 
as generality of the accommodated model class, robustness, computational ef¬ 
ficiency, treatment of missing data and performance of the estimators. 

So far in this book we have treated latent variables as random and pa¬ 
rameters as fixed which is the most common approach. Alternatively, both 
latent variables and parameters can be treated as fixed, either for theoretical 
reasons or for computational convenience. In contrast, Bayesians treat both 
latent variables and parameters as random variables. Although often viewed as 
a fundamentally different statistical paradigm, this approach is currently also 
often adopted for practical reasons. Recognizing these different perspectives 
is important for delineating different kinds of estimation methods. 

Random latent variables and fixed parameters 

When latent variables are treated as random and parameters as fixed, inference 
is usually based on the marginal likelihood, the likelihood of the data given the 
latent variables, integrated (or summed in the discrete case) over the latent 
variable distribution. In the case of continuous latent variables the likelihood 
generally does not have a closed form. In Section 6.3, we will hence describe 
several more or less accurate approximate methods of integration, includ¬ 
ing numerical and Monte Carlo integration (simulated likelihood). Different 
methods for maximizing likelihoods, including the EM and Newton-Raphson 
algorithms, are reviewed in Section 6.4. 

In Section 6.5 we discuss nonparametric maximum likelihood estimation 
(NPMLE), where we relax the assumption of normal latent variables. The idea 
of restricted maximum likelihood (REML) is briefly described in Section 6.6. 
For some models the dimensionality of integration can be considerably re- 


© 2004 by Chapman & Hall/CRC 




duced by using a limited information approach described in Section 6.7. Sec¬ 
tion 6.8.3 describes penalized quasi-likelihood (PQL), an approximate method 
that avoids integration, and Section 6.9 discusses the algorithmically similar 
generalized estimating equations (GEE). GEE is very different from the other 
approaches considered in this book since dependence among the responses is 
not explicitly modeled using latent variables, but instead treated as a nui¬ 
sance. Furthermore, the regression parameters are no longer interpretable as 
conditional or cluster-specific effects, but as marginal or population averaged 
effects. 

Fixed latent variables and parameters 

When latent variables are construed as unknown fixed parameters instead of 
random variables, integration is avoided. The fixed effects approach can be 
viewed as conditional on the effects in the sample. In this case it is irrelevant 
whether the clusters can realistically be considered a random sample from 
a population. We describe two fixed effects approaches to estimation in Sec¬ 
tion 6.10. In joint maximum likelihood (JML) estimation the latent variables 
and model parameters are jointly estimated, whereas the latent variables are 
loosely speaking ‘conditioned away’ in conditional maximum likelihood (CML) 
estimation. 

Random latent variables and parameters 

The Bayesian approach described in Section 6.11 treats both latent variables 
and parameters as random and bases inference on their posterior distribution 
given the observed data. In Section 6.11.5, we describe the popular Markov 
chain Monte Carlo (MCMC) method for sampling from the posterior distri¬ 
bution and estimating parameters by their posterior means. 

6.2 Maximum likelihood: Closed form marginal likelihood 

The integral involved in the marginal likelihood can in some instances be 
explicitly solved and expressed in closed form, the canonical examples being 
the LISREL model and the linear mixed model. In these cases multivariate 
normal latent variables and multivariate normal responses given the latent 
variables produce multivariate normal marginal distributions. 

Estimation of linear mixed models is discussed in Section 6.8.1. For the 
LISREL model introduced in Section 3.5, the model-implied covariance matrix 
was shown to be 


S = A(I_B) -1 ^(I_B) -1 A’ + ©. 


Since the mean structure is often not of interest in these models we let the 
n-dimensional response vector y ■ have zero expectation in this section. The 
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likelihood can then be expressed as 


j 

/(t?;Y) ； n( 27r )' 5 i s ' 1 i ex p(-yi s ' 1 y J -)- 

i=i 

The empirical covariance matrix S of y is the sufficient statistic for the param¬ 
eters structuring E. Since S has a Wishart distribution, it can be shown (e.g. 
Joreskog, 1967) that instead of maximizing the likelihood we can equivalently 
minimize the fitting function 

Fml = loglSI+t^SS- 1 ), 

with respect to the unknown free parameters A, ^ and 0. Fj^l is non¬ 
negative and only zero if there is a perfect fit in the sense that the fitted S 
equals S. The fitting function also provides an estimated information matrix 
for the maximum likelihood estimates. 

Browne (1984) suggested a general family of weighted least squares (WLS) 
fit functions for covariance structures, 

Fwls = [o-s^W-^CT-s], (6.1) 

where cr and s are vectors containing the nonredundant elements in the model- 
implied and empirical covariance matrix, respectively, and W is positive defi¬ 
nite weight matrix. The maximum likelihood estimator is obtained by using S， 
the covariance matrix implied by the parameter estimates, as weight matrix. 
WLS methods are also useful for limited information estimation of models 
without closed form likelihoods, a topic we will discuss in Section 6.7. 

Generalized linear random intercept models may also have closed form like¬ 
lihoods. Specifically, combining Poisson distributed responses given the mean 
exp(z/) with a gamma distribution for the mean produces a negative binomial 
marginal model (e.g. Greenwood and Yule, 1920; Hausman et ai, 1984). For 
dichotomous responses it is well known that the beta binomial model where 
probabilities are assumed to be beta distributed has a closed form likelihood 
(e.g. Skellam, 1948; Williams, 1975). Here, a regression model is often specified 
for the marginal logits (e.g. Heckman and Willis, 1977). Unlike the negative 
binomial model, this is not a generalized linear mixed model since it cannot 
be specified by including an additive random intercept in the linear predictor. 
Unfortunately, these useful results are not applicable for the common situation 
where covariates vary within clusters (e.g. Neuhaus and Jewell, 1990). 


6.3 Maximum likelihood: Approximate marginal likelihood 

The reduced form distribution of the responses given the explanatory vari¬ 
ables was derived using latent variable integration in Section 4.9.1. Regarded 
as a function of the fundamental parameters for given responses, this is the 
marginal likelihood /(*!?; y,X). 
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For the general model, the marginal likelihood is 

/( 化 y , X ) = 癌 

where the product is over all top-level clusters. Let Wy ⑴ |C( Z+ )) be the joint 
conditional probability (density) of the responses for a level-Z unit, given the 
latent variables at levels l and above, C( Z+ ) = (C ⑴ • • •, C( L )’)’. Starting from 
/ = 2, we can recursively evaluate the integrals 

/)( y w IC 赚） = / ㈣ 叩广、“，))…)， (6.2) 

up to level L. Here we have simplified the notation by setting 

/) (y w IC__) e ( y (,) |x w ， c(M+); A 

and will continue to do so in the remainder of the chapter. 

We will describe some integration methods in detail for the two-level random 
intercept model and indicate how they are extended for the general model. 
Setting r^ 2 ) = G，the random intercept model is given by 

v ij = ^ijf^ H" 0 • 

The joint density of the responses for the jth level-2 unit is 

9^ 2 \yj(2)) = ( ^(0) Y\. 10)^0 - (6.3) 

J—OO 名 

Unfortunately, there are in general no closed forms for the integrals involved. 
There are several approaches to approximating the integrals: 

• Laplace approximation, 

• Numerical integration using quadrature or adaptive quadrature, 

• Monte Carlo integration, 

which are described in Sections 6.3.1 to 6.3.3. Section 6.3.4 describes a tailor- 
made simulation approach for multivariate normal latent responses, based 
on latent response integration instead of latent variable integration (see Sec¬ 
tion 4.9). 

6.3.1 Laplace approximation 

For a unidimensional integral, the Laplace approximation can be written as 
f exp[f(x)]dx « f exp[f(x) — (x — x) 2 /2a 2 )]dx 

J—oo J—oo 

=J exp[/(x)]V / 27rcr^(a:; x, a 2 )dx 
=exp[f(x)]V27ra, (6.4) 
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where c/>(x] x, a 2 ) is a normal density with mean x and variance cr 2 ，x is the 
mode of f(x) and hence of exp[/(x)] and 



minus the inverse of the second derivative of f(x) with respect to x, evaluated 
at the mode x. 

The approximation is derived by expanding f(x) as a second order Taylor 
series around its mode so that the first order term vanishes (see inside brack¬ 
ets in the first line of (6.4)). The approximation is exact if the integrand is 
proportional to a normal density with mean x and variance <j 2 since f(x) is 
in this case quadratic in x. 

For a random intercept model, we need to evaluate the integral in (6.3). 
The integrand (corresponding to exp[/(x)]), 

办⑹❿⑴ ㈣ 。)， （ 6 _ 5 ) 

i 

is the product of the 4 prior 5 density of Q and the joint probability (density) of 
the responses given Q. After normalization with respect to Cj, this integrand 
is therefore just the ‘posterior’ density of Q given the observed responses for 
cluster j (see also Section 7.2). In the Laplace approximation, x therefore 
corresponds to the posterior mode Q and a corresponds to the curvature of 
the posterior at the mode, aj. The approximation becomes 

ln 5 (2) (y j(2) ) w + In h{C,j) + ^ln 5 (1) ( yij |Cj) 

i 

m p^cFj/V^) - ( 6 - 6 ) 

This approximation is good whenever the posterior density of the random 
intercept is approximately normal. It is well known that this is the case for 
large sample sizes (cluster sizes in this setting). This asymptotic normality 
is sometimes referred to as a Bayesian central limit theorem (e.g. Carlin and 
Louis, 1998). The posterior also becomes more normal as the conditional re¬ 
sponse probabilities become more normal, e.g. Poisson with large mean or 
binomial with large denominator. In this case the posterior mode approaches 
the posterior mean and aj the posterior standard deviation. 

In penalized quasi-likelihood (PQL) methods (e.g. Schall, 1991; McGilchrist, 
1994; Breslow and Clayton, 1993), the first term in (6.6) is ignored and the 
remaining terms are maximized with respect to the fixed effects parameters 
(3 (for known variance parameters). It is important to note that this does not 
correspond to maximum likelihood. Instead, the penalized quasi log-likelihood 

-C,/(2 妁 + ^2^9 {1) ( yij \<： j ) 

is jointly maximized. This is accomplished by maximization with respect to f 3 
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and then with respect to Q for given /3, since Q maximizes (6.5) and the log 
of (6.5) differs from the above only by the constant ln(^/27r^). An alternative 
derivation of the PQL approach is discussed in Section 6.8.3. 

Lee and Nelder (1996, 2001) define the hierarchical likelihood^ or /i-likelihood, 
as the joint distribution of the responses and latent variables treating the la¬ 
tent variables as observed. The log of the /i-likelihood is therefore 

4 = -C|/(2^) + ^111£/ (1 )(^-|0) - \n(y/2mp). (6.7) 

Maximizing the /i-likelihood with respect to /3 and Q (for fixed 矽） leads to the 
same estimates as penalized quasi-likelihood. The merits of the /i-likelihood 
are that it does not require integration and allows flexible specification of 
latent variable distributions. However, in the context of missing data prob¬ 
lems, Little and Rubin (1983; 2002, p. 124) argue that the approach does not 
“generally share the optimal properties of ML estimation except under trivial 
asymptotics in which the proportion of missing data goes to zero as the sample 
size increases” • In latent variable models, the missing data are the realizations 
of the latent variables so that the proportion of missing data goes to zero only 
if the cluster sizes go to infinity; see also Clayton (1996a). A useful discussion 
follows Lee and Nelder (1996). 

For linear mixed models, the likelihood equations for the /i-likelihood are 
the famous 4 mixed model equations’ for /3 and C (Henderson, 1975; Harville, 
1977). For given random effects covariance matrix 屯， the estimator for /3 is 
the maximum marginal likelihood estimator and the estimator for ^ is the 
empirical Bayes predictor or best linear unbiased predictor (BLUP) discussed 
in Section 7.3.1. 

Stiratelli et al. (1984) derive the same estimating equations for random 
effects logistic regression with dichotomous responses by maximizing the pos¬ 
terior distribution for (3 and Q under a diffuse prior for (5 (so that the posterior 
is essentially the /i-likelihood). 

The Laplace approximation is based on a second order Taylor series ex¬ 
pansion of f(x). Fourth order Laplace approximations are more accurate if 
the posterior density is not normal and have been used to correct the bias 
of parameter estimates obtained using second order Laplace (Breslow and 
Lin, 1995; Lin and Breslow, 1996). Approximate maximum likelihood using 
a sixth order Laplace approximation, known as LaPlace6, was proposed by 
Raudenbush et al. (2000). 

In small simulation studies of dichotomous responses, the sixth order Laplace 
approximation performed considerably better than PQL, somewhat better 
than 20-point Gauss Hermite quadrature (Raudenbush and Yang, 1998; Rau¬ 
denbush et al” 2000) and similarly to 7-point adaptive quadrature (Rauden¬ 
bush et al” 2000). However, Laplace6 (as implemented in HLM) was consid¬ 
erably faster than adaptive quadrature (as implemented in SAS NLMIXED). 
As far as we are aware, this method has so far only been used for generalized 
linear mixed models with nested random effects. 
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6.3.2 Numerical integration 
Gauss-Hermite quadrature 


Quadrature approximates an integral by a weighted sum of the integrand 
evaluated at a set of values or locations of the variable being integrated out. 
The locations and weights are referred to as quadrature rules. Gauss-Hermite 
quadrature rules are designed to evaluate integrals of the form 

[ex.p(-x 2 )f(x)dx « 

r=1 


exactly with R points if f(x) is a (2i? — l)th degree polynomial. Since the 
‘weight function’ exp(—a: 2 ) is proportional to a normal density, we can use the 
rule to integrate out the normally distributed latent variable in (6.3). We first 
change the variable of integration to a standard normal variable Vj = Q j\p^ 
so that the integral in (6.3) becomes 


" ⑺ ( y 汹） ： 




Y[g {1) (yij\V^ v j) dv h 


( 6 - 8 ) 


where (/>(•) is the standard normal density 

= -^=exp(-u?/2). 

Applying the Gauss-Hermite quadrature rule to this integral gives 

/ 0(^) II ^ (1) I ^ ^2PrY[9 (1) (yij\V^a r ), (6.9) 

J ~°° i i 


where p r =p*/ and a r = V^a*. 

The multivariate integrals in (6.2) required for the general model can be 
evaluated using cartesian product quadrature. Here we change the variables of 
integration to independent standard normally distributed latent variables v ⑴ 
so that 

= (6.10) 
where Q ⑴ is the Cholesky decomposition of the covariance matrix of 
< ⑴. The integrals over the Mi latent variables at level l then become 

9 « (y(o|v _ 勺） 

J —oo J —oo 

rL=l r Mz =l 

where v ( * +) = ( v ( 〜， …， v (i) 'y. 

The latent variables are therefore evaluated at a rectangular grid of points 
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as shown in the top left panel of Figure 6.1. A different number of quadra¬ 
ture points i?m can be used for each latent variable Vm requiring a total of 
n^Li -Rm evaluations of the integrand. 

An alternative to cartesian product quadrature rules is spherical quadrature 
rules which are specifically designed for integrating over multivariate normal 
densities (Stroud, 1971). As the name suggests, the quadrature points lie on 
(hyper)spheres (circles in two dimensions) instead of rectangles as shown in 
the bottom left panel of Figure 6.1. Importantly, spherical rules require fewer 
points than cartesian rules to obtain a given precision. However, for some di¬ 
mensionalities and required levels of precision, no spherical rules are currently 
available. This is perhaps the reason why spherical rules have not been used 
much for latent variable models, exceptions being Clarkson and Zhan (2002) 
and Rabe-Hesketh et al. (2004b). 

Gaussian quadrature was used for probit item response (IRT) models within 
a Fisher scoring algorithm (see Section 6.4.2) by Bock and Lieberman (1970) 
and within an EM algorithm (see Section 6.4.1) by Bock and Aitkin (1981). 
Butler and Moffitt (1982) suggested quadrature for random intercept probit 
regression models. Gaussian quadrature has also been used for models with 
other links, for instance the one-parameter logistic IRT model (Thissen, 1982)， 
and generalized linear mixed models (e.g. Hedeker and Gibbons, 1994,1996a). 



Figure 6.1 Locations for nonadaptive and adaptive integration using cartesian and 
spherical product rules, where 糾 =1, = 2, n = T2 = 1 and the posterior 

correlation is 0.5 (Source: Rabe-Hesketh et al v 2004b) 
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Gaussian quadrature works well if the product in equation (6.9) is well 
approximated by a low degree polynomial. However, in practice a large num¬ 
ber of quadrature points is often required to approximate the likelihood (e.g. 
Crouch and Spiegelman ， 1990). This will be the case for instance if there are 
a large number of level-1 units in a cluster j. The product in (6.9) can then 
have a very sharp peak and be poorly approximated by a polynomial. If insuf¬ 
ficient quadrature points are used, the peak could be located between adjacent 
quadrature points a r and a r +i so that a substantial part of the likelihood con¬ 
tribution of cluster j is lost (see upper panel of Figure 6.2). These problems 
have been pointed out by Lesaffre and Spiessens (2001) for dichotomous re¬ 
sponses and Albert and Follmann (2000) for counts. Note that the quadrature 
approximation can fail even for small cluster sizes for counts, since the indi¬ 
vidual 9^(yij\vj) can have sharp peaks. The approximation can also be poor 
for large random effects variances. Small simulation studies show that Gauss 
Hermite quadrature performs better than PQL (Raudenbush and Yang, 1998; 
Raudenbush et al” 2000)，but worse than adaptive quadrature (Rabe-Hesketh 
et al.， 2004b). 

Adaptive quadrature 

To overcome the problems with ordinary quadrature, adaptive quadrature 
essentially shifts and scales the quadrature locations to place them under the 
peak of the integrand. As discussed in Section 6.3.1, the integrand 




is proportional to the posterior density and can often be well approximated by 
a normal density (f){vj ; 内，丁 》) with some cluster-specific mean /^j and variance 



Instead of treating the prior density as the 4 weight function 5 when applying 
i quadrature rule as in (6.9), we therefore rewrite the integral as 





dvj^ ( 6 . 12 ) 


and treat the normal density approximating the posterior density as the weight 
function. 

Changing the variable of integration from Vj to Zj = (vj — f^j)/rj and ap¬ 
plying the standard quadrature rule yields 



W_9^\Vii\ V^Ol jr ), 
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where 


ajr 三 Tj d r + [Jjj , 

and 

TTjr = V27TTj exp(0^/2) + p r . 

The term in square brackets will be well approximated by a low-degree 
polynomial if the posterior density is approximately normal so that the nu¬ 
merator is approximately proportional to the denominator. We would there¬ 
fore expect the method to require fewer quadrature points than nonadaptive 
quadrature and to work well with large cluster sizes. The superiority of adap¬ 
tive quadrature can be seen in Figure 6.2 which illustrates for = 5 how 
adaptive quadrature translates and scales the locations so that they lie di¬ 
rectly under the integrand. 

When there are several latent variables, the posterior covariances must also 
be taken into account; see Naylor and Smith (1988) and Rabe-Hesketh et 
al. (2004b) for details. For two latent variables with a posterior correlation of 
0.5, the second column of Figure 6.1 shows how adaptive quadrature trans¬ 
forms the locations to fit more closely the elliptical contours of the (approxi¬ 
mately) bivariate normal posterior. 

Naylor and Smith (1982) take the mean and variance of the normal 
density approximating the posterior density to be the posterior mean and 
variance. Unfortunately, these posterior moments are not known exactly but 
must themselves be obtained using adaptive quadrature. Integration is there¬ 
fore iterative. Using starting values ^ = 0 and = 1 to define a^ r and 7r° r , 
the posterior means and variances are updated in the kth. iteration using 




4 = 

(rj =) 2 = 


Iiy ” (恥 I ^ a ]r X ) 


a^Hym) - 


[ 

U ajr ] [ - 


- (m 》) 2 , 


and this is repeated until convergence. A similar iterative algorithm is de¬ 
scribed in Naylor and Smith (1988). 

An alternative to computing the posterior moments /^j and is to use 
the mode and the curvature at the mode (Liu and Pierce, 1994) as in the 
first order Laplace approximation described in Section 6.3.1. In this case, 
adaptive quadrature with R = 1 quadrature point is equivalent to the first 
order Laplace approximation. An advantage of using the mode and curvature 
at the mode instead of the posterior moments is that computing the former 
does not require numerical integration. However, the approach is not easily 
generalized to multilevel models which led Rabe-Hesketh et al. (2004b) to 
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ordinary quadrature 



Figure 6.2 Prior (dotted curve) and posterior (solid curve) densities and quadra¬ 
ture weights (bars) for ordinary quadrature and adaptive quadrature. Note that the 
integrand is proportional to the posterior density. (Source： Rabe-Hesketh et al v 2002) 


adopt the Naylor and Smith approach for models belonging to our general 
model framework. 

Adaptive quadrature has been used by Pinheiro and Bates (1995) for two- 
level nonlinear mixed models, Bock and Schilling (1997) for exploratory factor 
analysis with dichotomous responses and Rabe-Hesketh et al. (2002, 2004b) 
for generalized linear latent and mixed models. Adaptive quadrature, as im¬ 
plemented in the Stata program gllamm, is used in most applications in this 
book involving continuous latent variables. 

Monte Carlo experiments have been carried out for two-level models with 
dichotomous responses with varying cluster sizes and intraclass correlations to 
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compare the performance of adaptive and ordinary quadrature (Rabe-Hesketh 
et al. 2004b). The performance of adaptive quadrature was excellent, requiring 
fewer quadrature points than ordinary quadrature. For combinations of large 
cluster sizes and high intraclass correlations ordinary quadrature sometimes 
failed, whereas adaptive quadrature worked well with a sufficient number of 
points. 


6.3.3 Monte Carlo Integration 

Let p be a vector of random variables with distribution h(tp). Assume that 
we require the expectation of a function f((f) over <p, 

J —OO 

Monte Carlo integration approximates the expectation by the mean of f((p) 
over simulated values of (p. Different versions of Monte Carlo integration arise 
according to how the simulation proceeds. 

Crude Monte Carlo integration 

In this case independent samples (p( r \ r = 1， •. • ，丑 ， are drawn from 
providing the simulator 

E ，) 卜 /：=^ E /(^ (r) ). 

By a strong law of large numbers, / converges to with probability 1 as 

>oo. By a central limit theorem, / is approximately normally distributed 
when R is large with mean E(/(#)) and variance 

Var(/) = 

Letting (p = Vj, h(tp) = and f{ip) = we see that 

the likelihood for the random intercept model, 



takes the form of E[/(<^?)]. Monte Carlo integration of the likelihood can then 
proceed by sampling from and evaluating the mean of 9^ 1 \yij\V : ^ v j)- 
Unlike the Laplace approximation, which improves only as the cluster sizes 
rij increase, we can improve the precision simply by increasing the number of 
simulations R. Furthermore, in contrast to quadrature or the Laplace approx¬ 
imation, we can assess the accuracy of the approximation by estimating the 
variance of the simulator. 
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Importance sampling 


Crude Monte Carlo integration can be improved by using importance sampling 
to reduce the sampling variance. A judiciously chosen importance density g(cp) 
is used to simulate E[/(<^)] when it is either difficult to sample (p from h(cp) 
or h((p) is not smooth. The integral is then written as 

别制 = /_>^扣， 

where g{(p) is a density for which (a) it is easy to draw (p, (b) the support 
is the same as for /(v?) ， (c) it is easy to evaluate given (p and (d) 

匕 bounded and smooth in the parameters over the support of (p. 
The importance sampling simulator is 

卜 ， 1 令咖 ( r W r) ) 

E [/ ㈣〜 ’ 

where </?( r ) is a draw from the importance density g(cp). 

Considering a random intercept model, an importance sampler can be con¬ 
structed as in (6.12) 

/ 咖 ; "W) 

J —oo 

where the importance density c/)(vj ; is the normal density approximat¬ 

ing the posterior density. Samples are drawn from c/)(vj ; //j, r^) to compute the 
mean of the term in brackets. Note that this method is analogous to adap¬ 
tive quadrature, which can be viewed as a deterministic version of importance 
sampling as pointed out by Pinheiro and Bates (1995). The multivariate ex¬ 
tension of this importance sampler has been used for generalized linear mixed 
models by Kuk (1999) and Skaug (2002). 

An interesting alternative to Monte Carlo integration is quasi-Monte Carlo 
where samples are drawn deterministically; see Shaw (1988) and Fang and 
Wang (1994, Chapter 2). 

6.3.4 A tailored simulator: GHK 

Some simulators are tailored for specific models, for instance the Geweke- 
Hajivassiliou-Keane (GHK) (Geweke, 1989; Hajivassiliou and Ruud, 1994; 
Keane, 1994) and Stern simulators (Stern, 1992) for models with multinormal 
latent responses, such as multinomial probit and multivariate probit models. 

For these models the likelihood contributions are probabilities, say p, of the 
form 

p m Pr [(rj - <ei<t^), <e 2 <T 2 + ), … ， （Tf <es<^| , 

where e = (ei, e 2 , … ， es) f is an *S-dimensional multivariate normal vector with 









mean zero and covariance matrix 5]. In the multivariate probit case, e would 
represent residuals of the latent responses y* and in the multinomial probit 
case differences between utility residuals. The integration limits or thresholds 
r s would typically depend on covariates. For instance in multivariate probit 
models for ordinal or dichotomous responses, t s = k s — x’/ 3, where k s are the 
parameters of the threshold model (see Section 2.4). 

Note that the probability p is an integral over a rectangular region of the 
latent response distribution, not over the latent variable distribution. In con¬ 
trast, other methods described in this section use latent variable integration 
(see Section 4.9 for a discussion of latent response versus latent variable inte¬ 
gration) .Latent response integration is also used in the limited information 
method to be discussed in Section 6.7. 

Here we briefly describe the popular GHK simulator. First, we exploit the 
fact that the probability p can be expressed as a product of sequentially con¬ 
ditioned univariate normal distributions. Defining 

Qx = Pr [(rf <€! <^ + )] , 

Q 2 = Pr [(t 2 - <e 2 <if) I (rf 

Q s = Pr[(r^ 〈馨 4) I <e S _i<r^_ 1 ), … • ， (rf < ei <^ + )] , 
the probability can be written as 

s 

p = Qs- 

S=1 

It is easy to calculate Qi = /cth) — /cru), where 少 (•）denotes 
the univariate cumulative standard normal distribution function and an is 
the standard deviation of ei，However, each Q s , 5 = 2,..., 5 is a conditional 
probability that e s lies within an interval, given that the other et (which are 
correlated with e s ) lie within specific intervals, which is difficult to evaluate. 

We therefore orthogonalize the residuals e using a lower diagonal Cholesky 
decomposition of the covariance matrix E, E = CC’，with elements c sm , 
c S m = 0 if m>s. We can then write e = Cu where u is an orthogonal vector, 
having independent standard normal components u s . 

The algorithm then proceeds as follows: 

1. For replication r 二 1,, 丑： 

(a) For 5 = 1: 

• Evaluate Qi r ： 

Qir = /cu) - $(rf /c n ). 

• Simulate u\ r from a doubly-truncated standard normal distribution, 
with truncation points at 丁「 /cn and / cu , so that e\ r = CuUi r 
fulfills the condition 丁 「 < e\ r < t 广 for the conditional probabilities 
Q2 to Qs- 
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(b) For s = 2: 


• Evaluate Q^r'. 

Treating u\ r as known, Q^r becomes an easily evaluated unconditional 
probability 

Q2r = 屯 ([4 _ C 2 lW lr ]/C22) — $([T 2 _ - C2lti lr ]/C22). 

The first term follows from the equivalence 

C2r = C2lU lr + C 2 2U2r < U 2r < [t^ - C2lU lr ]/C22, 

and similarly for the second term. 

• Simulate u<i r from a doubly-truncated standard normal distribution 
truncated at [r^—C 2 iUi r ]/c 22 and [r^— 021 ^ 1^/^22 so that C 2 r satisfies 
the conditions in the conditional probabilities Qs to Qs- 

(c) For s = 3,…， S: 

• Evaluate Q sr sequentially, treating u\ r to w s _i ?r as known: 

s-l s-1 

Qsr = 企 — 〉: C sm tX mr ]/ C ss ) — ^([t s — 〉: CsmUmr]/ C ss ), 

m=l m=l 

where the first term follows from 

^sr = 〉: CsmU-mr ^ > U sr ^ — 〉: C srn U mr \/C ss . 

m=l m=l 

• Simulate u sr from a doubly-truncated standard normal distribution 

truncated at [r~-Y^mh c S mUm r ]/c ss and c sm w mr ]/c ss so 

that e sr satisfies the conditions in the conditional probabilities Q s +i 
to Qs. (This step is not needed for 5 = S.) 

2. After R replications: The required simulated probability is obtained as 

p = 

Note that p is an unbiased simulator of p, with E(p) = p, where the expec¬ 
tation is over imagined repeats of the simulation. However, the simulated 
log-likelihood is a sum of terms of the form log ㈤ and these will generally 
be biased because E[log(p)] ^log(p), due to the nonlinear log transformation. 
The bias can be reduced by increasing the number of replications R. 

The GHK simulator is typically used in conjunction with gradient methods 
(to be discussed in Section 6.4.2) to obtain maximum simulated likelihood 
(MSL) estimators. We refer to Train (2003) for a detailed discussion of the 
properties of MSL and related estimators. 

In econometrics the GHK simulator is very popular for models with multi¬ 
normal latent responses, for instance probit panel (longitudinal) models (e.g. 
Keane, 1994; Geweke et al. ， 1994). This is probably because the simulator has 
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been shown to outperform other simulators although it is relatively easy to 
implement. In contrast to the crude Monte Carlo approach originally proposed 
by Lerman and Manski (1981)，the GHK simulator is a continuous and differ¬ 
entiable function of the parameters and produces simulated probabilities that 
are unbiased and bounded in the (0,1) interval. Furthermore, GHK is more 
statistically efficient than other simulators (e.g. Hajivassiliou et al., 1996). 
We refer to Stern (1997), Train (2003) and Cappellari and Jenkins (2003) for 
relatively nontechnical discussions of the GHK and related simulators. 

6.4 Maximizing the likelihood 

There are several methods for maximizing the likelihood, the most common 
being the Expectation-Maximization (EM) algorithm and Newton-Raphson 
or Fisher scoring algorithms to be described in Sections 6.4.2 and 6.4.1. Each 
of the integration methods introduced above may be combined with the max¬ 
imization methods to be described. 

6.4-1 EM algorithm 

The Expectation-Maximization (EM) algorithm is an iterative algorithm for 
maximum likelihood estimation in incomplete data problems. The algorithm 
was given its name by Dempster et al. (1977) who presented the general the¬ 
ory for the algorithm and a number of examples. Orchard and Woodbury 
(1972) first noted the general applicability of the underlying idea, calling it 
the ‘missing information principle’，although applications of the EM algorithm 
date back at least to McKendrick (1926). 

Perhaps the most prominent application is estimation when there are miss¬ 
ing data on random variables whose realizations would otherwise be observed 
(e.g. Little and Rubin, 2002). Another application, which is more important 
in the present setting, is in the estimation of latent variable models. In this 
case the realizations of latent variables are interpreted as missing data (e.g. 
Becker et al, 1997). 

The motivating idea behind the EM algorithm is as follows: rather than per¬ 
forming one complex estimation, the observed data is augmented with latent 
data that permits estimation to proceed in a sequence of simple estimation 
steps. 

The complete data C = {y,X, C} consist of two parts: the incomplete data 
y and X that are observable and the unobservable or latent data C- The 
complete data log-likelihood, imagining that the latent data were observed, is 
denoted ^(^IC). Here we used the h subscript since the log-likelihood is just 
the /i-log-likelihood of Lee and Nelder (1996). In general, the complete data 
log-likelihood itself is involved at each iteration of the EM algorithm, which 
takes the following form at the (/c+l)th step : 

E-step： Evaluate the posterior expectation 

QW k ) = E c [4(t?|C)|y,X;^], 
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the conditional expectation of the complete data log-likelihood with respect 
to the latent variables, given the incomplete data and the estimates 1 & k from 
the previous iteration, i.e. an expectation over the posterior density of C- 
M-step: Maximize with respect to ^ to produce an updated estimate 

This can sometimes be accomplished analytically, but usually requires it¬ 
erative algorithms such as gradient methods (see Section 6.4.2). 

We now consider implementation of the EM algorithm for two-level latent 
variable models with multinormal latent variables. As in Section 6.3.2 it is 
convenient to change the variables of integration to independent standard 
normal latent variables Vj = ( 巧 i, 巧 2, • • • ，巧 m)' using the Cholesky decompo¬ 
sition Cj = Qvj (where Q depends on 办 ). The complete data log-likelihood, 
treating the orthogonalized latent variables Vj as observed, can then be ex¬ 
pressed as 

4(t?|C) = In j n 9{yij IQ v j) n ^ v om )} 


l y^A n d{yij\Q v j) +> 

j l i rn ) 

E 4( 寧)， 


(6-13) 


where ^(^|C) is a cluster-contribution to the complete data log-likelihood. 
E-step: Evaluate 

Q(^ k ) = E c [4(t?|C)|y ；1 ? fe ] 

=YlJ 4( i? l c ) w ( v j|yj(2)；^ fe )dvj, 

where oj{wj\y is the posterior density of the latent variables Wj for 
cluster j given the observed responses 5^(2) for that cluster. Using Bayes 
theorem, the posterior becomes 




(6.14) 


Ui 9(Vij\Q k ^j^ k ) Um ^jm) 

I Hi 9{yij\Q k ^j^ k )Um <KVjm)dVj’ 

where the k superscript in Q k denotes that this matrix depends on 
Using (6.13) and (6.14), simplifies to 


Q ( 寧 fc ) = Yljk J j J] ln 9(Vij IQ v j ； ^) + ln ) j 

X JJ 9{yij\Q kv j : > ^ k ) JJ dVj, 
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where 


A j=[ II ^ IQ fev ^ II ^ v 3m) dvj. 

i m 

Note that Aj does not depend on the unknown parameters ^ but only on 
the values obtained in the previous iteration. 

The E-step is complicated since the integral cannot in general be solved 
analytically, and several approximate methods based on numerical integra¬ 
tion or simulation have been suggested. Monte Carlo integration has been 
suggested (e.g. Wei and Tanner, 1990) to yield Monte Carlo Expectation 
Maximization (MCEM) algorithms. In the present context, we consider 
vector draws dj r = (dj r i,dj r 2 ,.. •, dj r M)' for the independent normally 
distributed random variables in Vj with replications r = 1,2,..., i?. This 
provides the Monte Carlo integration approximation 

<5 MC ( 寧 fc ) =YH2 c fr C l ln 9(Vij IQdjr； 1?) +^2\ncl ， (d jrm ) \ , 

j r l. i m ) 


with weights that do not depend on the unknown parameters 办， 

C MC _ Si 9{yij |Q fc dj>; ^ k ) 

jr ~ ErE i 9(yio\Q%r^ k y 

satisfying J2 r c j^° = 1. As an alternative to using crude Monte Carlo in¬ 
tegration, Meng and Schilling (1996) suggest using Gibbs sampling (see 
Section 6.11.5). A problematic feature of MCEM is the inability to quan¬ 
tify the Monte Carlo error introduced at each step of the algorithm (e.g. 
McCulloch, 1997; Hobert, 2000). If the number of replications R is too 
small the E-step will be swamped by Monte Carlo error, whereas an un¬ 
necessarily large R is wasteful. In fact, Booth et al. (2001) point out that 
MCEM does not converge in the usual sense unless R increases with k. 
Since the latent variable distribution is specified as normal, Gauss-Hermite 
quadrature can alternatively be used yielding 


Q GH W^ k ) = \ + 1 —) 卜 

3 rx VM V i rn ) 


with weights that do not depend on 办， 

c GH _ Pr m Ylj9{l/ij\Q k ^ k ) 

〜_ ErPr m 

satisfying c^f = 1. Here, a r = (a r i,a r2 , •••, a rM )' and p rm =p r de- 
note quadrature locations and weights, respectively. Bock and Schilling 
(1997) suggest improving the quadrature approximation by using adaptive 
quadrature. 
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M-step: Maximize with respect to 办 . 

For MCEM，this is equivalent to solving the equation 

dQ MC (^ k ) n 

d 分 _H C 介 M - • 

3 r 

This is a weighted score function for ^ in a generalized linear model for 
expanded data yij, dj r with known weights cjf c • can thus simply 

be maximized by weighted maximum likelihood, using standard software. 
For Gauss-Hermite quadrature the M-step amounts to solving 

aQ Gg (^|-i? fc ) -Y^Y^c 011 ^^ ln g(^-1Q a ^_ 0 

which can also be maximized with an appropriately weighted maximum 
likelihood algorithm. See Aitkin (1999a) for a suggested implementation 
of this algorithm for two-level models and Vermunt (2004) for higher-level 
models. 


Usually, the E-step is the demanding step but in some situations the M- 
step is more difficult. One simplification may be the use of the ECM algorithm 
(Meng and Rubin, 1993) which replaces each M-step with a sequence of con¬ 
ditional maximization steps with subsets of ^ being fixed at their previous 
values. Other modifications are discussed in Little and Rubin (2002). Another 
possibility is to use simulation in the M-step, proceeding by crude Monte Carlo 
integration or by means of more elaborate approaches such as the Metropolis 
algorithm (see Section 6.11.5) used by McCulloch (1994, 1997) or importance 
sampling as suggested by Booth and Hobert (1999). 

It should be noted that EM works best if the complete data distribution is of 
regular exponential family form. In this case it can be shown (e.g. Tanner, 1996) 
that the E-step consists of estimating the complete data sufficient statistics by 
their posterior expectations. Given these estimates, the likelihood equations 
for the M-step then take the same form as for complete data, so that standard 
software can be used. 

It is instructive to consider estimation of a conventional exploratory fac¬ 
tor model, where implementation of the EM algorithm is extremely simple. 
Expected sufficient statistics, expressed in closed form, are obtained in the 
E-step whereas the M-step requires only elementary linear algebra. 

Example: EM exploratory factor analysis 

Consider the exploratory factor model introduced in Section 3.3.3 ， 

Yj = Ar^ + 勺， 

where are mean-centered responses ， 力〜 Nm(0 ， I) ， 勺〜 N n (0,0), rjj 
and €j independent, © diagonal and A unstructured. 

It follows from the exploratory factor model that 

6〜味5])， 
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We also obtain 



Here, 


\A k A k/ 




and 

r k = i-F k \ k . 

M-step: express the standard multivariate regression estimators in terms 
of the expected sufficient statistics from the E-step (instead of the usual 
sufficient statistics): 

E ?? (S OT |y,t? fc ){E, 7 (S J)J) |y ,#)}- 1 

© fc+1 奪 diag(E T ,(S ra |y,^ fc ) - 

Er,(S OT |y, 炉 ) [EWSJy, ^)]- 1 E„(S ?/J) |y, ^) 7 ) 
diag(S W - SyyF k ， (F k S yy F k， + T^-^Syy) 

: rn diag(S w -A fe+ 1 F fc S w ). 

The idea of using the EM algorithm for estimation of factor models was due 
to Dempster et al. (1977). Elaboration and extension to confirmatory factor 
models (see Section 3.3.3) was presented by Rubin and Thayer (1982) and 
Schoenberg and Richtand (1984). Liu et al. (1998) show that the algorithm 
presented above is a special case of the so-called parameter extended EM (PX- 
EM) algorithm. Chen (1981) describes estimation of the conventional MIMIC 
model (see Section 3.5) using the EM algorithm. 

Example continued: EM exploratory factor analysis 

The EM algorithm can alternatively be implemented in a slightly different 
way, which may have more intuitive appeal. Noting that the above equa¬ 
tions for A fe+1 and © fe+1 correspond to linear regression of y^ on fjj = F fc y^-, 
we may proceed with the following iterative algorithm: 

E-step ： Impute the missing by rj 1 ^. This is an example of the degression 
method’ for factor scoring to be described in Section 7.3.1. 

M-step: Estimate A ^ +1 (row i of A) and O^ 1 by OLS regression of yij 
(the ith component of y^) on rj 1 ^. This simplification arises since the 
model has a diagonal residual covariance matrix 0. 

This formulation illustrates that the EM algorithm can be viewed as a formal¬ 
ization of an intuitive approach to handling missing data: ( 1 ) impute missing 
values by predicted values, ( 2 ) estimate parameters treating imputed values 
as data, (3) re-impute the missing values treating the new estimates as true 
parameters, (4) re-estimate the parameters, and so on until convergence. It is 
important to keep in mind that this imputation based approach only works if 
the complete data likelihood equations are linear in the missing data. Other¬ 
wise, the approach may yield severely biased estimates. 

Moving outside the exponential family, the numerical inaccuracy of the E- 
step is liable to produce artificial modes for the function to be maximized in 
the M-step (e.g. Meng and Rubin, 1992). This led Meng and Schilling (1996) 
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to use data augmentation, generating latent responses y* to modify the E- 
step for the probit factor model of Bock and Aitkin (1981). Although the 
model for y is not a member of the exponential family, Meng and Schilling 
exploit the fact that the complete data distribution of the latent responses y* 
is in exponential family form. See also Section 6.11.5 for an example of data 
augmentation in probit models. 

The EM algorithm has been used for a wide range of latent variable models. 
For instance, factor and MIMIC models for continuous responses have been es¬ 
timated by Dempster et al. (1981) and Chen (1981)，respectively, exploratory 
probit factor models by Bock and Aitkin (1981) and the one-parameter lo¬ 
gistic IRT model by Thissen (1982). Linear mixed models were considered by 
Strenio et al (1983) and Raudenbush and Bryk (2002) and generalized linear 
mixed models by Aitkin et al. (1981)，Aitkin (1999a) and Vermunt (2004). 
The EM algorithm is the most popular method for estimating latent class and 
finite mixture models (e.g. Goodman, 1974). 

An often mentioned advantage of the EM algorithm is ease of implementa¬ 
tion as compared to other optimization methods. Although this is certainly 
true in many settings, it should be evident from the formulae derived above 
that this argument does not appear to have much force in the context of com¬ 
plex latent variable models. Theoretical advantages include the fact that each 
iteration increases the likelihood and that if the sequence 护 converges, it 
converges to a local maximum or saddle point. 

An important disadvantage of the EM algorithm is that convergence can be 
very slow whenever there is a large fraction of missing information. Another 
disadvantage is that an estimated information matrix is not a direct byproduct 
of maximization, in contrast to the case for gradient methods such as Newton- 
Raphson. One possible approach is to augment EM with a final Newton- 
Raphson step after convergence. Procedures for obtaining the information 
matrix within the EM algorithm have been suggested by Louis (1982)，Meng 
and Rubin (1991) and Oakes (1999) among others. 


6.4-2 Gradient methods 

In order to maximize the log-likelihood, we must solve the likelihood equations 



Gradient methods are iterative where we let the parameters in the kth. itera¬ 
tion be denoted 

Newton-Raphson and Fisher Scoring 

The Newton-Raphs on algorithm can be derived by considering an approxima¬ 
tion of the derivatives of the log-likelihood using a first order Taylor series 
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expansion around the current parameter estimates ^ : 

T 。 9 

= g(^) + H(^)(^-^ fc ), (6.15) 

where g (办 fc ) is the V-dimensional gradient vector and is the Hessian, 

the v x v matrix of second derivatives of the log-likelihood with respect to the 
parameters, evaluated at i9 k . The updated parameters i9 k+1 are the parame¬ 
ters for which this first order Taylor expansion is zero, i.e.， 

g(i? fc ) + H(i? fc )(i? fe+1 = 0 

so that 

# +i = # —H(# 》 -4(#*). 

Note that the Taylor expansion is exact if the log-likelihood is quadratic in 
the parameters, in which case the maximum is found in a single iteration. The 
canonical example is the standard linear regression model. 

The Fisher scoring algorithm is similar to the Newton-Raphson algorithm 
but the negative of Fisher’s information matrix I(^ fc ) is used instead of the 
Hessian, i.e. 

以 fc+i = 1? fc +I ( 办 fc)_i g ( 办 &)， 

where 

I(i? fc ) = -E (H (沪 )). 

An advantage of the Newton-Raphson and Fisher scoring algorithms compared 
with EM is that they provide estimates of the standard errors for the maximum 
likelihood estimates 'd. In the case of Fisher scoring, the inverse information 
is used, whereas the inverse of —H(^), the ‘observed information’，is used in 
the Newton-Raphson algorithm (see also Section 8.3). 

Quasi-Newton methods 

Newton-Raphson and Fisher scoring require second order derivatives of the 
log-likelihood with respect to the parameters. Computing these analytically 
is often difficult and computing them numerically can be very slow. Thus, 
quasi-Newton algorithms only requiring gradients have been proposed. 

A useful algorithm was described by Berndt, Hall, Hall and Hausman (1974). 
Their BHHH or BH 3 algorithm is based on the fact that, under correct model 
specification, the information matrix equals the covariance matrix of the gra¬ 
dients, 

I(t? fc ) = -E (H(#)) = E (g(#)g(W). 

From a law of large numbers it follows that a consistent estimator of this 
covariance matrix is given by 

1 一 ( 沪) 三 
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where g z (^ k ) are score vectors, the top-level cluster contributions to the gra¬ 
dients, 

g (沪) 

Z = 1 

The BH 3 algorithm uses this estimator in Fisher-Scoring, giving 

+ [I bh3 (^)]-!§(#). 

A merit of the BH 3 algorithm is that only gradients are required; neither 
Hessians nor Fisher information matrices must be computed. Little and Rubin 
(2002) claim that the performance of the BH 3 algorithm can be erratic because 
the accuracy of the approximation to the information matrix depends on the 
validity of the model, but our experience suggests that the algorithm works 
well even with 4 bad，starting values. 

Other examples of quasi-Newton algorithms include Davidon-Fletcher-Powell 
(DFP) and Broyden-Fletcher-Goldfarb-Shanno (BFGS) which involve differ¬ 
ent approximations of the Hessian Although the approximations work 

well for optimization, caution should be exercised in basing estimated co- 
variance matrices for the estimated parameters on these approximations (e.g. 
Thisted, 1987). 

The Newton-Raphson algorithm has been used for latent class models by 
Haberman (1989), for generalized linear mixed models by Pan and Thompson 
(2003) and for generalized linear latent and mixed models by Rabe-Hesketh et 
al. (2002, 2004a). Pan and Thompson used analytical first and second order 
derivatives of the log-likelihood whereas Rabe-Hesketh et al. used numerical 
derivatives. McCulloch (1997) and Kuk and Cheng (1997) discuss Monte-Carlo 
Newton Raphson algorithms for generalized linear mixed models. 

Fisher-scoring was used by Longford (1987) for linear mixed models. The 
BH 3 algorithm was used in latent variable modeling by Arminger and Kiisters 
(1989), Skrondal (1996) and Hedeker and Gibbons (1994，1996a), among 
others. Davidon-Fletcher-Powell (DFP) was introduced for factor models by 
Joreskog (1967). Skaug (2002) used a quasi-Newton method with line search 
for generalized linear mixed models. Here first order derivatives were obtained 
by automatic differentiation, that is, the code for evaluating the derivatives 
was generated automatically by a computer program from the code to evaluate 
the log-likelihood. 

6.5 Nonparametric maximum likelihood estimation 

Estimation of models with discrete latent variables is straightforward using 
EM or gradient methods since the likelihood is a finite mixture so that no 
integration is involved. The discrete distribution is characterized by a finite set 
of locations e c , c = 1, … ，C and probabilities or masses tt c (at these locations). 
If the number of masses C of the discrete distribution is chosen to maximize the 
likelihood, the nonparametric maximum likelihood estimator can be achieved 
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(e.g. Simar, 1976; Laird, 1978; Lindsay, 1983); see Section 4.4.2. Attempting 
to add an additional mass-point would then either result in one estimated 
probability approaching zero or two estimated locations nearly coinciding. 

In this section we briefly describe methods for finding the number of masses 
of the nonparametric maximum likelihood estimator (NPMLE). A common 
approach is to start estimation with a large number of mass-points and omit 
points that either merge with other points or whose mass approaches zero 
during maximization of the likelihood (e.g. Butler and Louis, 1992). Another 
approach is to introduce mass-points one by one using the concept of a di¬ 
rectional derivative (e.g. Simar, 1976; Jewell, 1982; Bohning, 1982; Lindsay, 
1983; Rabe-Hesketh et al” 2003a), referred to as the Gateaux derivative by 
Heckman and Singer (1984). 

Consider a model with a single latent variable with maximized log-likelihood 
, 7C C ,e c ) for C masses. To determine whether this is the NPMLE solu¬ 
tion, we consider changing the discrete mass-point distribution along the path 
([1 — A]^ C ，A)’ with locations (e c ,e c+1 ) / , where A = 0 corresponds to the 
current solution and A = 1 places unit mass at a new location e c+1 . The 
directional derivative is then defined as 


A(e c+] 


lim 

入 —0 


£(^ C , ([1 - {e c ,e c ^Y) - £(d C , 7T C , e°) 


(6.16) 


According to the general mixture maximum likelihood theorem (Lindsay, 1983; 
Bohning, 1982), the NPMLE has been found if and only if < 0 for 

all e c+1 . 


Rabe-Hesketh et al. (2003a) suggested searching for a new location e c+1 
over a fine grid spanning a wide range of values and to terminate the algorithm, 
if for a small value of A the numerator of (6.16) is negative for all locations. 
This approach is similar to the algorithm proposed by Simar (1976), adapted 
by Heckman and Singer (1984)，Follmann and Lambert (1989), among others. 
Algorithms for finding the NPMLE (both the number of masses and parameter 
estimates) are described in Lindsay (1995) and Bohning (2000). 

The important merit of NPMLE is that we need not assume a paramet¬ 
ric distribution for the latent variables, potentially making inferences more 
robust. However, for generalized linear models with covariate measurement 
error, simulations indicate that inference based on multivariate normal latent 
variables is fairly robust to misspecification (e.g. Thoresen and Laake, 2000). 
Although Schafer (2001) and Rabe-Hesketh et al (2003a) found that estimates 
assuming the conventional (misspecified) model were biased, these estimates 
had a smaller root mean squared error than the unbiased NPML estimates. 
Little is known about the performance of NPMLE for models with a large 
number of latent variables. For models with categorical responses, boundary 
solutions where locations approach ± oo can pose problems. 

Nonpar ametric maximum likelihood estimation has been used for survival 
or duration models (e.g. Heckman and Singer, 1984; Holmas, 2002)，item 
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response models (e.g. de Leeuw and Verhelst，1986; Lindsay et al” 1991), 
generalized linear models with covariate measurement error (e.g. Roeder et 
al., 1996; Aitkin and Rocci，2002; Rabe-Hesketh et al, 2003a)，random coeffi¬ 
cient models (e.g. Davies and Pickles, 1987; Aitkin, 1999a) and meta-analysis 
(e.g. Aitkin, 1999b). We use NPMLE in many applications in this book, for 
instance in Sections 9.5, 11.2, 11.3.3 and 14.2. 


6.6 Restricted/Residual maximum likelihood (REML) 

It is instructive to initially consider the simple linear regression model 
Vi = x;/3 4 •屬， ej~N(0,6i), i=l,...,N 

where (3 contains P regression parameters. The maximum likelihood estimator 
of the residual variance is 

v i=l 

It is well known that this estimator is downward biased with 4 bias factor 5 
i.e., E(0) = ^^-0. 6 would be unbiased if the regression parameters 
/3 were known, but is biased when based on /3 since x.^/3 ‘fits the data more 
closely’ than x^/3. Thus, the bias-corrected estimator, 

百 = '斯， 

i=l 

is typically used instead. 

The same bias issue applies for latent variable models where estimates of 
variance parameters are expected to be biased downwards. For instance, for 
a two-level random intercept model Raudenbush and Bryk (2002) point out 
that the maximum likelihood estimator of the random intercept variance ^ is 
biased with approximate bias factor 

To address this problem Patterson and Thompson (1971) suggested so- 
called restricted or residual maximum likelihood method (REML) • Maximum 
likelihood is in this case not applied directly to the responses y but instead to 
linear functions or 4 error contrasts’ of the responses, say Ay. Importantly, A 
is specified as orthogonal to X so Ay ‘sweeps out’ the fixed effects from the 
model. It follows that REML itself does not produce estimates of the fixed 
effects (3. 

REML can alternatively be derived from a Bayesian perspective (see Sec¬ 
tion 6.11). Specifically, a flat prior is specified for /3, whereas the variance and 
covariance parameters are regarded as fixed. The latter parameters can be es¬ 
timated using maximum marginal likelihood, after integrating out the /3 and 
latent variables. Empirical Bayes (see Section 7.3.1) is employed to ‘score’ the 
latent variables and estimate the /3 parameters (e.g. Harville, 1977; Dempster 
et al” 1981; Laird and Ware, 1982). 
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Although developed for linear mixed models with purely continuous re¬ 
sponses, approximate REML methods based on penalized quasi-likelihoods 
have also been suggested for generalized linear mixed models (e.g. Schall, 
1991; Breslow and Clayton, 1993; McGilchrist，1994; Stiratelli et al, 1984). 
Longford (1993, p.236) suggested another approach, adding a penalty term to 
the marginal log-likelihood. 

There does not seem to be a clear winner when contrasting the performance 
of REML and maximum likelihood (ML). The standard argument in favor of 
REML over ML is that unbiased estimators of variance and covariance pa¬ 
rameters are produced. It should be noted, however, that the bias of ML will 
be important only if there are few clusters compared to the number of fixed 
effects. In this case the utility of latent variable modeling itself may be ques¬ 
tionable, so the performance of REML versus ML becomes a secondary issue. 
Furthermore, the mean squared error is often used as optimality criterion in¬ 
stead of bias. Interestingly, the mean squared error may be larger for REML 
(e.g. Corbeil and Searle，1976), as was also indicated by simulations conducted 
by Busing (1993) and van der Leeden and Busing (1994). A disadvantage of 
using REML is that deviance testing is precluded for fixed parameters, since 
REML itself does not provide estimates of fixed effects. On the other hand, 
for the special case of balanced mixed ANOVA models，REML estimates of 
variances and covariances are identical to classical ANOVA moment estima¬ 
tors. It follows that the REML estimator in this particular case has minimal 
variance properties and does not rely on any normality assumption. Finally, 
it has been argued that REML is less sensitive to outliers than ML (Verbyla, 
1993). 

6.7 Limited information methods 

In this section we consider models for conditionally (given the latent variables) 
multivariate normal latent responses yj with constant cluster size, rij = n, 
j = 1,..., J. We will consider two-level models here, although higher-level 
models are accommodated if the number of level-1 units in the highest-level 
units is constant, as in multivariate growth curve models. We furthermore as¬ 
sume that the latent variables C are multivariate normal so that the marginal 
distribution (w.r.t the latent variables) of the latent responses is multivariate 
normal. In this case the reduced form parameters are the parameters char¬ 
acterizing the marginal mean and covariance structure; see Section 4.9.2. It 
follows from multivariate normality that the univariate and bivariate distri¬ 
butions (marginal w.r.t the other latent responses) are also normal. 

In the limited information approach we first use the univariate and bi¬ 
variate distributions to estimate 4 empirical，or unrestricted versions of the 
reduced-form parameters. For instance, in a model without covariates this 
would be the unrestricted means and covariances (tetrachoric correlations 
in the dichotomous case) of the yj, which would generally not satisfy the 
model-implied constraints. We then estimate the structural parameters using 
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a weighted least squares fitting function as in (6.1)，minimizing the distance 
between the model-implied and estimated (unrestricted) reduced form param¬ 
eters. For continuous responses where yj = y^, the univariate and bivariate 
distributions contain all information about the reduced form parameters. In 
contrast, for coarsened responses (such as dichotomous) this is no longer the 
case giving rise to the term limited information’. 

In the present context, the idea of using univariate and bivariate information 
appears to be due to Christofferson (1975) and extended by Muthen in a series 
of papers (e.g. Muthen, 1978, 1981, 1982, 1983, 1984, 1988a, 1989bc). Since 
this limited information approach is little known outside psychometrics, we 
will provide a somewhat detailed sketch of it. 

Consider the model introduced by Muthen (1983, 1984), which generalizes 
the structural equation model presented on page 78 to latent responses y*. 
The structural model was given in equation (3.33)， 

rij ^ oc + Brij + T Xi + Cj, 

and the response model is 

y* = v + Arjj + Kxj + ej. 

For continuous responses, the observed responses simply equal the latent re¬ 
sponses. Dichotomous, ordinal and censored observed responses are related to 
the latent responses via threshold functions as described in Section 2.4. The 
reduced form becomes 

y*j = t/ +A(I-B) _1 [a + rxj + ^] + Kxj +% 

with expectation structure 

E(y*| Xj ) = v + A(I - B)- 1 ^ + JA(I - Bpr + K] x), 

n 0 iii 

and covariance structure 

n = Cov(y^j) = A(I —B 广 1 屯 (I —B) -1 ’A，+ ©. 

For simplicity we assume from now on that the observed responses are 
dichotomous. In this case the diagonal elements of either 0 or are typ¬ 
ically fixed to one for identification. Following Muthen, we use the second 
parametrization in this section so that the covariance structure becomes 

ft* = diag(H)fidiag(ri) - 士. 

We furthermore assume that the thresholds are set to zero for identification. 
In order to simplify the subsequent development we define the augmented 
covariate vector Zj = (l ， x ;)’， including a one for the intercepts in addition to 
the covariates. The expectation structure can then be written as 

E (yj\ z i) = nz ” 

where II = (n 0 ,IIi) is the reduced form regression matrix including inter- 
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cepts. Following Muthen, we specify multinormal distributions for Cj and €j, 
and obtain 

We can write the univariate distribution of a variable or response i = 

1,2,... ,n, as 

Vij\^j ~ N(7riZ^l), 

where 7Ti is the ith row of II. The bivariate distribution of two responses y*j 
and y*,j^ i’ = 1, 2, • •.， n，can be written as 

㈣, ~n 2 ([S]，[4 J )， 

where is the i^th residual correlation element in 17*. 

We now spell out the three-stage limited information approach developed 
by Muthen. Note that II and 17* will stand for unrestricted versions of the 
reduced form parameters. 

Stage 1: The first estimation stage produces limited information maximum 
likelihood estimates for the reduced form intercepts and regression param¬ 
eters II from univariate information. A univariate probit regression is spec¬ 
ified for each z, 

Pr (恥 == $1(77^-), 

where 伞 i(.) is the standard normal cumulative distribution function. The 
log-likelihood contribution for cluster j and variable i becomes 

OO 德 Vij In^lCTT^Zj) + (1 - Vij) ln[l - 

For each z, the univariate log-likelihood ^7=1 then maximized, 

producing consistent estimates 7Ti， The gradients for cluster j are assembled 

g 卜 …調 

[^7ri dir 2 a7r n J 

Stage 2a: Conditional on the first stage reduced form intercept and regres¬ 
sion parameter estimates II, the reduced form residual correlations in 
are estimated by limited information 4 pseudo 5 maximum likelihood based 
on bivariate information for each pair of responses ii r i>i\ 


= 1 ? Hi 1 j = l|z^, Z^/j) 
^(yij = 

Pr(yy = l,y ifj = O|z 0， z〜) 


= 伞 2 (_ 巧巧 , TIVZj .， -O, 
= -TTi'Zj, -O, 
= $ 2 (-71^，-71*内，04)， 


where ^ 2 (^ 1 , p) is the bivariate standard normal cumulative distribution 
function with means "1 and "2 and correlation p and is the residual 
correlation. The corresponding bivariate log-likelihood contribution, given 
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the estimates 7Ti and 7?^ from stage 1, becomes 

= VijVi'j ln$2(TTjZj, Tii' Zj, W-i ， ) + 

Viji 1 - Vi'j) ln$ 2 (7fiZi ， -c4,) + 

(1 - yij)Vi'j ln$ 2 (-7fiZj,7f^Zj, + 

(1 - yij)(l - yvj) ln$ 2 (-7fiZi ， 


For each pair ii\ the 4 pseudo log-likelihood’ (in the sense of Parke, 1986) 
^2j=i ^ut / 1 ) is then maximized, producing the consistent esti¬ 

mate cD*^. The gradients for cluster evaluated at the maximum pseudo¬ 
likelihood, are assembled in 


d 



deji dej 2 C 


For later use, we also define 


E j = [gi'^giT, 


N 

g = Eg j ， 

■7=1 

and put the gradients of the bivariate log-likelihoods with respect to the 
reduced form intercepts and regression parameters, evaluated at the maxi¬ 
mum pseudo-likelihood, in the vector 


di 


"Mi 

dizi ’ 


a4 54 <,n-l 

d7Z 2 ’ dl：i 5 d7T 3 ’ ’ 07T n —1 ’ d7T n 


Stage 2b: Let the nonredundant elements of the reduced form parameters 
II and be assembled in the vector cr. The estimated ag^mptotic co- 
variance matrix of the estimated reduced form parameters, Cov ( 左 ) ， is then 
derived based on marginal information from stages 1 and 2a (e.g. Lee, 1982; 
Muthen, 1984). 

Expanding the gradient g(?) of the maximum likelihood estimates around 
the true value a using the mean value theorem gives 

o = g(^) = g(^) + ) (&-&), 

where cr* is some point between a and a. Multiplying by J 吾 ， we obtain 

J5(ct-ct) = (_J -l gg =) ) x J-ig(d-). 


Regarding the first term, using a law of large numbers gives 


© 2004 by Chapman & Hall/CRC 






3 p 


denotes convergence in probability. Note that A is partitioned 


as 


An 0 

A21 A22 


where A 12 = 0, since the derivatives of the gradients of the univariate 
likelihoods with respect to the correlations are zero because no correlations 
are involved there. 


For the second term, using a multivariate central limit theorem gives 


猶-吃 g，) A N(0,V), 

i=i 


where 卫 ^ denotes convergence in distribution and 
J 

V 三 Jim Cov(J _ ^g(o*)) = Jim J 一 1 

hoc J ^°° j=l 

since Y^j=i E(g j (d*)) = 0, the expected scores are zero at the maximum. 
It follows that 

卫 ^ N(0 ， A -1 VA -1/ ). 


Under correct model specification，it follows from the information matrix 
equality，that 

E( g v)gV)’)_ 

j=i 、 〆 j=i 

Using this result, we can estimate the matrices in A using the gradients 
obtained in the previous stages: 

An = 

i=i 

a 22 = 


又 21 = 

㈣ 

The covariance matrix V is estimated by the empirical covariance of the 
gradients 

J 

v = 

j=l 
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The asymptotic covariance matrix of the estimated reduced form parame¬ 
ters is finally estimated as 

W = C^(ct) = A _ 1 d _：U 

Stage 3: The reduced form parameters are regarded as functions cr ( 汐 ) of the 
fundamental parameters 办 .A consistent estimator ^ is obtained by fitting 
0 -(^) to a, minimizing the weighted least squares (WLS) criterion 

F(i?) = - [<r(i?)-5]’W _1 [<r(i?)-?]. (6.17) 

Defining 

A = 

a model-based estimator of the asymptotic covariance matrix of ^ is 

C^(S) = 

This estimator is consistent if the model is correctly specified. A large sam¬ 
ple chi-square distributed test statistic of absolute fit against the estimated 
reduced form parameters is obtained as 2J F^). 

We refer to Kiisters (1987) and Muthen and Satorra (1996) for technical 
details. Olsson (1979) provides details regarding the first two stages for poly- 
choric correlations, Olsson et al. (1982) for polyserial correlations and Muthen 
(1989c) for tobit correlations. An overview of different latent response correla¬ 
tions, denoted 4 polytobiseriaP by Kiisters (1987), was presented in Table 4.2. 

Importantly, an analogy to robust normal theory estimation (see Satorra 
(1990) for a review) was suggested by Muthen (1993) for dichotomous re¬ 
sponses and Muthen et al. (1997) for the general Muthen model. The ‘robust’ 
asymptotic covariance matrix of 必 is obtained as 

Cov(d) = J- 1 (A / W~ 1 A)- 1 A / W~ 1 WW~ 1 A(A / W~ 1 A)- 1 . 

Muthen (1993) suggested simply using W = I in the above expression as well 
as in the WLS criterion (6.17), effectively simplifying the latter to unweighted 
least squares. Muthen et al. (1997) instead specified W as a diagonal matrix 
with estimated variances of a as elements. The resulting W is then used in 
the 4 robust 5 covariance matrix and the fit criterion now becomes diagonally 
weighted least squares. A beneficial feature of these approaches is that W need 
not be inverted, which can be problematic for ‘large’ models and/or small 
samples and/or highly skewed dichotomous responses. Satorra and Bentler 
(e.g. Satorra and Bentler, 1994; Satorra, 1992) furthermore propose 4 robust 5 
tests for absolute fit. Muthen et al. (1997) also discuss the connections to 
GEE, a methodology discussed in Section 6.9. 

The limited information methodology developed by Muthen and others has 
many merits. It handles a general model framework, although only models 
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with multinormal latent responses. Thus, models with for instance a logit link 
or Poisson distribution cannot be accommodated. The methodology is very 
computationally efficient, reducing a possibly high dimensional integration 
problem to a series of univariate and bivariate integrations, which is especially 
valuable for models with many latent variables. The approach also appears to 
be remarkably efficient, producing estimates that are very close to maximum 
likelihood. An important limitation is that missing data can only be handled 
by using multiple-group models so that only a few missing data patterns can 
be handled in practice. Monte Carlo experiments (e.g. Muthen and Kaplan, 
1992) have shown that the method can perform poorly for complex models if 
the sample size is small. 

6.8 Maximum quasi-likelihood 

We first discuss the iteratively reweighted least squares (IRLS) algorithm for 
generalized linear models and the iterative generalized least squares (IGLS) 
algorithm for multivariate linear models. This is not only of interest in itself 
but also as a precursor for quasi-likelihood, marginal and penalized quasi¬ 
likelihood (MQL and PQL) as well as generalized estimating equations (GEE). 


6.8.1 Iteratively reweighted least squares 

Consider a generalized linear model (see Section 2.2) with log-likelihood 
i - ^2[yi0i - + c{yi, <j>). 

i 

The likelihood equations in this case are 

d£ — ▽ \ dl dfM db(6i) dOi , 

W P = \ l Vi WiW P — ~^WiW P \ ’ 小， 
for p = 1,..., P. For the cumulant function b(-), 
db(0i) 

~de~ m ^ 

Substituting these expressions into the likelihood equations, 

W P = p 篆 [’(Urt] = o. ( 6 _ 18 ) 

This can be further simplified using the relations fJ>i = g~ 1 (i / i) and z/^ = x^/3, 
djM _ dfM _ Wi) _ x pi 
d(3 v dvi Xpi dvi Xpi 
where g’ （叫） is the first derivative of g(-) evaluated at [m. 
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Substituting these expressions into the likelihood equations, 

dl x p i r n n 

dP P ~ ^ 9'{^i)4>y{^i) Vl ^ ~ 


(6-19) 


For a linear model (identity link and normal distribution) with possibly het- 
eroscedastic residual variance ^ = of ， = 1, = h so that the 

likelihood equations are linear in 卢， 

義广 g 茺[板 — x :/3] = 0, (6_20) 

and can be solved by weighted least squares with weights l/cr| , 

3 = (X / V- 1 X)- 1 X , V- 1 y, (6.21) 

where V is a diagonal matrix with diagonal elements equal to of • 

Iteratively reweighted least squares (IRLS) is a procedure which linearizes 
the likelihood equations in each iteration so that estimates for the next it¬ 
eration can be found by weighted least squares. Let the estimates from the 
‘current’ iteration be denoted (3 k and the corresponding mean A working 
variate is defined as 

z! = g(^) + 

so that 

Vi-^l = [z^-^yg'i^). 

We now show that the estimates can be updated by weighted least squares 
(as if the model were linear) with weights given by 1/af = [g r (j>V. 
Substituting these weights into the weighted least squares equations (6.20 )， 


0 = ^7RrnR) [ " i_x，i/3] 


=Xpi [Zj - x;/3] 

_ ^ 9 '{^) 

= 心 (/^( 必 [讲 _ Mi ]， 


we obtain the original likelihood equations (6.19)，except that the denomina¬ 
tors of the first terms (the ‘weights’）are held fixed at the estimates from the 
previous iteration k. Solving these equations by weighted least squares gives 
estimates (3 k+1 , leading to new weights, then new estimates using ‘reweighted ， 
least squares and so on, iterating until convergence. Note that for generalized 
linear models, the iteratively reweighted least squares algorithm is identical 
to Fisher scoring. 

Another way of conceptualizing the algorithm is by approximating the gen¬ 
eralized linear model by a linear model using a first order Taylor series expan¬ 
sion. Let h(yi) = = 叫 and h r {vi) be the first derivative evaluated at 

Vi ，In the fcth iteration, yi is approximated as 

Vi = h(^)+^(P-/3 k )h'(^) + e, 
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where Var(e) = 0F("i). Rearranging terms, 

yi -h^) + ^ k h\v^) = 够 

Multiplying by =p’(/^)，we obtain 

+ = 4 = <(3+Mh 

a linear model for z\ with mean x(/3 and variance M^vin). 

Note that the details of the algorithm depend only on the link and variance 
functions of the generalized linear model. If we wish to specify an arbitrary 
link and variance function, we can employ the same algorithm to estimate 
the parameters even if the specification does not correspond to any statistical 
model. This idea of estimating parameters without specifying a model is known 
as quasi-likelihood (Wedderburn, 1974) and the corresponding equations as 
quasi-score equations or estimating equations. McCullagh (1983) showed that 
quasi-likelihood estimators have similar properties to maximum likelihood es¬ 
timators such as consistency and asymptotic normality with covariance matrix 
given by the same formula as for maximum likelihood. 


6.8.2 Iterative generalized least squares 

In multivariate linear models with known residual covariance matrix V ⑼， 
the parameters can be estimated by generalized least squares (GLS) where the 
diagonal matrix in (6.21) is replaced by the (nondiagonal) covariance matrix. 
Since the residual covariance matrix is generally not known, we must use 
iterative methods such as iterative generalized least squares (IGLS). Writing 
a multilevel linear mixed model for the response vector of the entire sample 
as 


y — X/3 + A ⑼ < ⑼ + e ， 

and letting be the ‘current’ estimate of the covariance matrix of the 
total residual ^ = A(p)<( D ) + €， the regression parameters can be updated 
using GLS 


0 k+1 = (X ， (vh)-(6.22) 

Using these updated estimates, the variance parameters 屯⑴ are estimated 
from the residuals r fc+1 = y — X’/3 + 1 ，giving a new estimate of the covariance 
matrix V^ 1 . Specifically, the matrix of cross-products r fc+1 r fc+1/ is formed 
and its expectation equated to V 这卜 The expectation of the vectorized ma¬ 
trix of cross-products can be written as a linear regression with variance pa¬ 
rameters as coefficients. For instance, for a two-level random intercept model 
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(omitting the superscripts), 


/ rl, \ 

r2irn 

rh 


E[vec(rr , )] = E r i，ii 
ri2V21 


( 寸 + 0 \ 

0 

0 



0 


+ 0 


0 

0 


V r ljj J W + V 1 / V 1 / 


The variance and covariance parameters are then estimated by generalized 
least squares where the covariance matrix of vec(rr’）is derived from the pre¬ 
vious estimate 

The IGLS algorithm hence iterates between updating the parameters of 
the fixed and random parts of the model. The resulting estimates are max¬ 
imum likelihood estimates under normality of C(d) and € (Goldstein, 1986). 
Goldstein (1986) shows how inversion of the very high-dimensional covariance 
matrix can be simplified, exploiting its block diagonal structure. 

After convergence, the standard errors of the estimated regression parame¬ 
ters are estimated from the last GLS step treating the covariance matrix V( d) 
as known. These standard errors are generally correct since the estimates of 
the fixed part are uncorrelated with the estimates of the random part. An 
important exception is the situation where responses are missing at random 
with missingness depending on observed responses, not just covariates. Con¬ 
sider for instance a linear random intercept model for longitudinal data with¬ 
out covariates. If the probability of dropout increases with the magnitude of 
the observed response prior to dropout, a larger random intercept variance 
(i.e. a higher intraclass correlation) would imply a larger fixed intercept (since 
the imputed values for those who dropped out would be higher), making the 
two estimates positively correlated. See also Verbeke and Molenberghs (2000, 
Chapter 21). 

For details of the IGLS algorithm we refer to Chapter 2 and Appendix 
2.1 in Goldstein (2003). A slight modification of IGLS to restricted iterative 
generalized least squares (RIGLS) leads to restricted maximum likelihood 
(REML) estimates (Goldstein, 1989). Yang et al. (1999) propose an exten¬ 
sion of the IGLS algorithm for multilevel structural equation models (see also 
Rabe-Hesketh et al, 200Id). 


6.8.3 Marginal and penalized quasi-likelihood 

Marginal quasi-likelihood (MQL) and penalized quasi-likelihood (PQL) have 
been derived in a number of ways (see Section 6.3.1 and McCulloch and Searle, 
2001). Here we give a summary based on the description in Goldstein (2003). 

MQL and PQL are based on approximating generalized linear mixed models 
by linear mixed models so that the IGLS algorithm can be applied (which 
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no longer corresponds to maximum likelihood). This linearization method is 
analogous to iteratively reweighted least squares described in Section 6.8.1. 

In generalized linear mixed models the conditional expectation of the re¬ 
sponse is (retaining only the i subscript for level 1 units) \Xi = where 

"(•) is the inverse link function. The model for yi is linearized by expanding 
h{vi) as a first order Taylor series around a known ‘current’ value of the linear 
predictor from iteration fc ， 

4 = (6.23) 

1=1 

giving 

Vi « h^i) +- 才歸 ) + E Z *V- (6.24) 

l 

Here ei is a heteroscedastic error term with variance corresponding 

to the chosen distribution family. Note that this expression is linear in the 
unknown parameters (5. The sum of the terms involving known current values 
(5 k and C^ k is treated as an offset 

Oi = h(y\) - b!(y\)^ k 

l 

and the terms involving latent variables ^ contribute to the total residual 

' ^ = E 喃 

i 

giving 

Vi = A+ +&• 

Multiplying by we can therefore obtain (3 k+1 using generalized least 

squares as in (6.22) where is the covariance matrix of the total residuals 
for all units for the current estimates. 

There are several variants of this algorithm (Goldstein, 1991 ， 2003; Long¬ 
ford, 1993, 1994). In marginal quasi-likelihood (MQL), is set to zero 
in (6.23) and (6.24). In penalized quasi-likelihood (PQL), the expansion is 
improved by setting the latent variables equal to the posterior modes based 
on the linearized model, C ⑴ fc = C ⑴' The difference between MQL and PQL 
is hence in the offset used. Since MQL sets the latent variables to zero, the 
fixed effects estimates are essentially marginal effects. These are attenuated 
relative to the required conditional effects as discussed in Section 4.8.1. 

The estimator for the latent variables is more easily expressed in terms of 
the 4 working variate’ 

Zi = {yi - Oi)/h'{vi), 

so that the linear approximation can be written using the GF formulation (see 
Section 4.2.2) as 

Z (L) - X( L )/3 + A( L )C(i,) + 
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and the latent variables are updated using 

Cf # = 斗) v 辑知广 x w M )， 

where V( L ) is a block of for a top-level unit. The parameters of the 
random part are estimated from the residuals ri = yi—h{vi) as in Section 6.8.2. 
The algorithm therefore iterates between updating (3 for given and 
and updating ^ and C(l) given current values of the other parameters. 

The algorithms have been improved considerably by using a second order 
Taylor expansion in the latent variables (Goldstein and Rasbash ， 1996). In 
the case of PQL, this improves both the offset and the variance of the total 
residual. In the case of MQL，the offset is not affected since the random part 
is set to zero. For more details see Appendix 4.1 in Goldstein (2003). 

The PQL approach has been used for generalized linear mixed models by 
Goldstein (1991)，Schall (1991), Breslow and Clayton (1993)，Longford (1993), 
Wolfinger and O’Connell (1993), Engel and Keen (1994) and Me Gilchrist 
(1994) among others. The algorithm is computationally very efficient since 
numerical integration is avoided. Furthermore, the approach can be used for 
models with crossed random effects (e.g. Breslow and Clayton, 1993; and, in 
linear mixed models, Goldstein, 1987)，and (spatially or temporally) autocor- 
related random effects (e.g. Breslow and Clayton, 1993; Langford et ai, 1999), 
as well as multiple membership models (e.g. Hill and Goldstein, 1998; Rasbash 
and Browne, 2001); see also Section 3.2.6. See Section 6.11.5 for an alternative 
approach to the analysis of models with crossed random effects. However, PQL 
has not been used for models with factor loadings or structural equations. 

These methods work well when the conditional distribution of the responses 
given the random effects is close to normal, for example with a Poisson distri¬ 
bution if the mean is 5 or greater (Breslow, 2003), or 7 or greater (McCulloch 
and Searle, 2001) or if the responses are proportions with large binomial de¬ 
nominators. The methods also work well if the conditional joint distributions 
of the responses belonging to each cluster are nearly normal or, equivalently, 
if the posterior distribution of the random effects is nearly normal. Even for 
dichotomous responses, this becomes increasingly the case as the cluster sizes 
increase. 

However, both MQL and PQL perform poorly for dichotomous responses 
with small cluster sizes (e.g. Rodriguez and Goldman, 1995, 2001; Breslow 
and Lin, 1995; Lin and Breslow, 1996; Breslow et al, 1998; Goldstein and 
Rasbash, 1996; Browne and Draper, 2004; McCulloch and Searle, 2001; Bres¬ 
low, 2003). In such situations, PQL is a better approximation than MQL and 
second order expansions of the random part (MQL-2 or PQL-2) yield better 
results than first order expansions (MQL-1 or PQL-1). However, Rodriquez 
and Goldman (2001) found that estimates of both fixed and random parame¬ 
ters were attenuated even for PQL-2, in contrast to ML and Gibbs sampling, 
for ‘large’ random effects variances. They point out that it is hazardous to 
use methods that work well for ‘small’ random effects variances since the de¬ 
gree of within-cluster dependence is rarely known in advance. Unfortunately, 
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MQL-2 and PQL-2 are sometimes numerically unstable, a problem reported 
by Rodriquez and Goldman (2001) for one of their examples. 

The standard errors for (5 do not take into account the imprecision in the 
estimates of This can result in large biases since the fixed effects estimates 
are generally correlated with the variance estimates in generalized linear mixed 
models. This is obvious in PQL where the offset used to estimate the fixed 
effects depends on the variance. Another drawback of marginal and penalized 
quasi-likelihood is that no likelihood is provided, precluding for instance the 
use of likelihood ratio testing, model selection criteria such as AIC, BIC etc. 
and likelihood based diagnostics (see Chapter 8). 

6.9 Generalized Estimating Equations (GEE) 

Estimation using 4 Generalized Estimating Equations ， (GEE) was initially ad¬ 
vocated in a series of papers by Liang, Zeger and their colleagues (see Liang 
and Zeger, 1986; Zeger and Liang, 1986; Zeger et al” 1988). GEE can be con¬ 
sidered as a generalization of the quasi-likelihood method described in Sec¬ 
tion 6.8.1 to multivariate regression models. This methodology is popular for 
dependent responses, for instance repeated measurements or clustered data. 

The marginal expectation (with respect to any latent variables) of the re¬ 
sponses is modeled using a generalized linear model. For a two-level model 
with Uj observations for cluster j, 

9[E(2/y|xij)] = Xy/3. 

The regression parameters /3 then represent marginal or population averaged 
effects. Importantly, these effects differ from conditional or cluster-specific 
effects (see Figure 1.6 and Section 3.6.5). 

We now consider the marginal variances and covariances of the responses 
given the covariates. The variances are assumed to be corresponding 

to the specified generalized linear model (see Section 2.2). Combining these 
variances with a working correlation matrix Rj(a) structured by parameters 
a, the covariance matrix becomes 

where is a diagonal matrix with elements &’’(〜）= V(fiij). 

The quasi-score equation in (6.18) can now be generalized to generalized 
estimating equations of the form 

s /3^^ a ) = ~ ^ = °' 

which depend not only on the marginal effects /3 but also on the dependence 
parameters a. Here, 

W = B ^ 

where A) is a diagonal matrix with elements 
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Liang and Zeger (1986) propose iterating between (1) estimation of f3 (for 
given a and 0) solving the generalized estimating equations and (2) estimation 
of ol and (j) (for given ff) based on Pearson residuals rfj, 

= [Vij -校爲辦轉勝 

They propose a moment estimator for the overdispersion parameter 0 ， 




and the following moment estimators for a for different correlation structures 
Rj ㈣ ： 

• ‘Independence’ 

— Correlation structure: Cor(y^-, =0. 

• ‘Exchangeable’ 

— Correlation structure: Cor(y^, y^j ) = a, 

— Estimator: a = ^ Ef=i ^(n'-i) E 的 , 

• 4 AR(1 )， 


— Correlation structure: Cor(t/^,y^ + tj) = a^, t=0,1,..., 

- Estimator: S= ^ Ef=i ^TJ 

• ‘Unstructured’ 

— Correlation structure: Cor(yij, y^j) = cxu， ， 

— Estimator: S “'= 去 T,jLi r ij r ^j- 

Liang and Zeger (1986) showed that the estimated marginal effects /3 are 
asymptotically normal and consistent as the number of clusters increases. Im¬ 
portantly, these estimates are 4 robust’ in the sense that they are consistent for 
misspecified correlation structures, assuming that the mean structure is cor¬ 
rectly specified. Consistent estimates of the covariance matrix of the estimated 
marginal effects are typically obtained by means of the so called sandwich es¬ 
timator described in Section 8.3.3. 

Instead of using the above moment estimators of a, Prentice (1988) sug¬ 
gested adding a second set of estimating equations for the correlation pa¬ 
rameters ol (and possibly 0 ) • Define the vector of products of Pearson resid¬ 
uals Uj = and the diagonal matrix Wj = 

diag{Var(«), Var(«),... ， Var(r^._ i The estimating equations 

for ol can then be expressed as 

S a (P,cx) = 广 E( Uj .)] = 0. 

3 


© 2004 by Chapman & Hall/CRC 




These approaches based on Pearson correlations are problematic for cate¬ 
gorical responses where the Pearson correlation is in general not a suitable 
measure of association. For the special case of dichotomous responses this 
makes some sense, but a problem is that the admissible range of the cor¬ 
relation depends on the marginal probabilities (e.g. Lord and Novick, 1968; 
Bishop et al, 1975). A more natural measure of association for categorical 
data is the odds-ratio, and a parametrization based on marginal odds-ratios 
was proposed by Lipsitz et al. (1991). The odds-ratios are typically structured 
to simplify the working correlation matrix, for instance by specifying a com¬ 
mon odds-ratio. Log-linear models may also be specified letting the odds-ratio 
depend on covariates. In general, the specification of the working correlation 
matrix entails a trade-off between simplicity and loss off efficiency due to 
misspecification (e.g. Fitzmaurice, 1995). 

Zhao and Prentice (1990) and Liang et al. (1992) proposed extending the 
above first order estimating equations (GEE-1) to second order estimating 
equations (GEE-2). Here, a joint estimating equation 

( Cov(u-,y,) W, ) ( uf- Eiuj) ) = ° 

is simultaneously solved for (3 and ol. The major merit of GEE-2 as compared 
to GEE-1 is in efficiency gain, primarily for a. However, the robustness of 
GEE-1 is lost since GEE-2 rests on correct specification of the dependence 
structure. Moreover, obtaining the required estimate of Cov^^y^) is quite 
involved. 

When marginal odds-ratios are used to represent dependence, Carey et 
al. (1993) suggest estimating ol using logistic regression with an offset. This 
implementation of GEE is called ‘alternating logistic regressions’ (ALR). ALR 
preserves the robustness of GEE-1 regarding /3 while producing estimates ol 
that are almost as efficient as using the more complex GEE-2 methodology. 

Since explicit integration is avoided, the GEE methodology is definitely 
an important contribution to the estimation of models for longitudinal and 
clustered data. We use GEE for longitudinal data on respiratory infection in 
Section 9.2 where it is also compared to random effects modeling. Interest¬ 
ingly, GEE has recently been extended to factor models (Reboussin and Liang, 
1998), where the dependence structure is of primary interest. 

A rather severe limitation is that missing data can apparently only be han¬ 
dled under the restrictive assumption of missing completely at random MCAR 
(Liang and Zeger, 1986), since the estimating equations will otherwise be bi¬ 
ased (e.g. Rotnitzky and Wypij, 1994). However, it is often not recognized that 
missingness may actually depend on covariates but not on observed responses 
(Little, 1995). Robins et al. (1994) suggest combining estimating equations 
with inverse probability weighting, yielding consistent estimators if the miss¬ 
ing data mechanism is correctly specified. 

Another limitation is that it is in general difficult to assess model adequacy 
in GEE (e.g. Albert, 1999); likelihood based diagnostics are for instance not 
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available. The use of GEE should furthermore be reserved to problems where 
marginal or population averaged effects are of interest and avoided in analy¬ 
ses of etiology. This is because causal processes must operate at the cluster 
or individual level, not the population level. Population averaged effects are 
therefore merely descriptive and largely determined by the degree of hetero¬ 
geneity in the population. Finally, Lindsey and Lambert (1998) and Crouchley 
and Davies (1999) point out that the estimated regression parameters are no 
longer consistent if there are endogenous covariates such as 4 baseline’ (initial) 
responses in longitudinal data. 


6.10 Fixed effects methods 

The main focus in this book, including the discussion previously in this chap¬ 
ter, has been on latent variables as random variables. In this section, we depart 
from this interpretation and instead consider latent variables as unknown fixed 
parameters. 


6.10.1 Joint maximum likelihood 

At first sight, it may appear natural to attempt the simultaneous estimation 
of the fundamental parameters ^ and the 4 latent scores’ or values attained by 
the latent variables C by maximizing the likelihood of the responses given the 
latent variables. For the two-level random intercept model, the log-likelihood 

In 山⑴⑼加以 ㈣ 

ij 

is maximized with respect to both /3 and the Q, j = 1 ， ...，J • Thus, the random 
intercepts are simply treated as fixed parameters and estimated alongside (3. 
Note that this likelihood differs from the /i-likelihood in equation (6.7) on 
page 164, the joint likelihood of the responses and latent variables, which is 
also jointly maximized with respect to both parameters and latent scores. 

In a three-level random intercept model estimation of the level-three in¬ 
tercepts would require constraints on the level-2 intercepts, for instance that 
they add to zero for each level-3 cluster. It is therefore more natural to omit 
all higher-level latent variables that have lower-level counterparts. Hence we 
assume that the model only includes latent variables at level 2. 

A fundamental problem may arise due to the fact that the number of latent 
scores to be estimated increases with the number of level-2 clusters J. This is 
the well-known problem of estimating 4 structural parameters’ in the presence 
of ‘incidental parameters’. In our context the fundamental parameters are 
regarded as structural and the latent scores as incidental. The basic problem 
is that maximum likelihood estimators of the structural parameters are not 
necessarily consistent when incidental parameters are present (Neyman and 
Scott, 1948; see also Lancaster, 2000). Zellner (1971, p. 114-115) demonstrates 
how inconsistency comes about in a simple example. 
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Importantly, there is no incidental parameter problem for linear models with 
normally distributed residuals or loglinear models with a Poisson distribution. 
In these models the estimates coincide with those from conditional maximum 
likelihood to be described in Section 6.10.2. See Cameron and Trivedi (1998) 
for further discussion. 

For common factor models with continuous responses, Anderson and Rubin 
(1956) demonstrated that the likelihood presented by Lawley (1942) did not 
have a maximum. As a solution to this problem McDonald (1979) suggested 
using a maximum likelihood ratio (MLR) estimator, where the alternative is 
that 屯 is any positive definite matrix. It turns out that the resulting estimates 
of the structural parameters equal those obtained by ML in the random factor 
case, whereas the estimators for the factor scores are inconsistent, being given 
by the Kestelman expressions for ‘indeterminate’ factor scores (Kestelman, 
1952; Guttmann, 1955). 

Turning to dichotomous responses, the ramifications for the logistic regres¬ 
sion model with cluster-specific intercepts, n=2 level-1 units and one dichoto¬ 
mous covariate were established by Andersen (1973) (see also Chamberlain, 
1980 and Breslow and Day, 1980). Specifically, Andersen demonstrated that 
the joint maximum likelihood estimator is inconsistent for J oo, since (3 
converges in probability to 2(3. Simulations performed by Katz (2001) suggest 
that joint maximum likelihood is safe if rij = n> 20 and might be acceptable 
if 8 < n < 16. For the Rasch model, simulations indicate that /3 converges to 
^y/ 3, where n is the number of items. Haberman (1977) proved that the bias 
vanishes if J —^ oc, n —> oo and ^ — oo. For fixed effects probit regression 
Heckman (1981b) conducted a Monte Carlo experiment and found modest 
bias, always towards zero. 

Numerical problems are common in the case of dichotomous responses. If all 
responses for a cluster are zero (one), the joint maximum likelihood estimate 
Cj will diverge to —oo (oo). Moreover, if an item i is failed (passed) by all units 
in the Rasch model, Pi tends to —oo (oo). Note that the latter is a problem 
in marginal maximum likelihood estimation as well. For the two-parameter 
logistic IRT model it was observed by Wright (1977) that arbitrary upper 
bounds on the discrimination parameters typically must be imposed in order 
to prevent the estimates from diverging. Another important limitation of joint 
maximum likelihood is that cluster-level covariates cannot be included. 


6.10.2 Conditional maximum likelihood 


Rasch (1960) suggested using conditional maximum likelihood estimation in 
the context of the one-parameter logistic IRT model or Rasch model, 


Pr( yij = l) 


exp(/?» + Q) 

1 + exp (戊 + Cj) 


discussed in Section 3.3.4. Importantly, this approach circumvents the inci¬ 
dental parameter problem. 
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The conditional maximum likelihood approach requires the existence of suf¬ 
ficient statistics for the parameters Q. The likelihood is then maximized con¬ 
ditional on these sufficient statistics. 


Consider a logistic regression model where there are fixed cluster-specific 
intercepts Q, so the linear predictor becomes Vij = x 匕 /3+。. For dichotomous 
responses, 斯 =0, 1， the model becomes 


Pr (认 《?• = 1 l x O’） 


exp(x^./3 + Cj) 

1 + exp(x^./3 + Cj)' 


(6.25) 


The Rasch model is the special case of the fixed effect logistic regression where 

The responses within a cluster j are independently distributed due to the 
cluster-specific intercepts, with joint probability 


n J 

Pr ( yj )； « J] Pr ( 阳 =1) 叫 [1 - Pr 加产： Q] 1 - 叫 . 

i=l 


Substituting (6.25) for Pr(y^- = 1) and reexpressing, the joint probability can 
alternatively be written as 

[ n 3 n 3 

Cj XI yy + /3 , 5Z 叉咖 + I 3 ) 

i=l i=l 


where 

n 3 

a{Cj,P) = In nt 1 + exp(x^/3 + Q)}- 
i=l 

It follows from the theory of exponential family distributions that Yl7=i IHj 
is a minimal sufficient statistic for the cluster-specific parameter Q. Conse¬ 
quently, the conditional distribution Pr(y^-1 Yl7=i Vij) does not depend on Q. 
The idea of conditional maximum likelihood is to estimate the 4 structural 
parameters，by maximizing Ylj Pr(yj| Y^i=i Vij), which does not contain the 
incidental parameters 5 Q. 

The joint probability can also be expressed as 


Pr(y,)= 

and the probability of Tj = 


Ev 


E 


'’歡 it 1 + exp(x^./3 + Cj )]~ 
YTili Vij as 

expfCjTj + !3' J2Zl 


djeB(r } ) 


nriji+expWj^+o )]' 


丁 j = 0,1,..., rij, 


where B(jj) = {dj = (dij, • • • ， d nj j) : dij = 0 or 1, ^ij = T j}- The num¬ 
ber of elements in B(rj) is (^), which increases rapidly with the cluster size. 
The conditional probability = 丁 j) can now be obtained by 
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dividing Prfy』）by Pr(E™ii 2/u = t j)> 
Pr 


exp[/3’ E^i ^ioVij] 


^2djeB(Tj) ex P[〆 J2i=l x ij^ij] 

It is worth noting that when Tj = 1 this probability has the same form as the 
multinomial logit model (see Section 2.4.3) and a partial likelihood contribu¬ 
tion for Cox regression (see Section 2.5). 

The conditional likelihood lc{/3\X.) becomes 


i c (m = n 


exp[〆 YJiU 

Edjes^) ex P[〆 Xi 為 ■] ’ 


which does not depend on the incidental parameters Q. Note that clusters 
with Tj = 0 or Tj = rij, having only zero or unit responses, do not contribute 
to the likelihood since their conditional probabilities become 1. Hence, there 
may be a considerable loss of data, especially when the clusters are small. 

It is instructive to investigate the conditional likelihood for the special case 
of clusters of size 2; e.g. rij = 2. Here, the only situation contributing infor¬ 
mation is Tj — 1, e.g. (yij = 0, 2 / 2 ) = 1) and (yij = 1, y% = 0). The conditional 
probability of the former becomes 


Pr(yy =0,y 2j = l\yij+y 2 j = 


_ Pr(y lj = 0,y 2 j = l) _ 

Pr(2/ii=0,t/2i = 1) + Pr( yij = l,y 2j =0) 

expt/j^xa—xij)] 
l + exp[ / 3 / (x 2j -xi j )]' 


The resulting conditional likelihood thus reduces to an (unconditional) logistic 
likelihood for dichotomous responses = 1 if (yij = 0, y 2 j = 1) and = 0 if 
(y 1 j = l^y 2 j = 0) (discarding concordant responses) with covariates X 2 j — xij. 
Importantly, it follows that elements of /3 pertaining to covariates that do not 
vary over units cannot be estimated by conditional maximum likelihood, since 
the corresponding elements of x 2j - —'X-ij become zero for all j. In a (matched) 
case-control study the responses are (yij = 0, 物 =1) if V 2 j and yij represent 
the case and control, respectively. It follows from the above result that the 
conditional likelihood now reduces to an (unconditional) likelihood for a lo¬ 
gistic model with constant response = 1 and covariates X 2 j — xij. See also 


Holford et al (1978). 

Andersen (1970) demonstrates that conditional maximum likelihood yields 
consistent and asymptotically normal estimators under weak regularity condi¬ 
tions. For the exponential family he also shows that the estimates are asymp¬ 
totically efficient under ‘S-ancillarity’ (e.g. Barndorff-Nielsen, 1978). Andersen 
(1973) demonstrates that this condition holds for the Rasch model. 

An advantage of conditional maximum likelihood is that we need not make 
distributional assumptions regarding the latent variables. Importantly, this 
implies that inference based on this approach is likely to be more 4 robust’ 
than using random effects. If the random effects model is true we would ex- 
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pect a loss of efficiency using the conditional maximum likelihood estimator, 
but Andersen (1973) demonstrates that the loss is small. Another important 
merit is that correlations between the latent variable and covariates is un¬ 
problematic, a feature that has attracted a lot of attention in econometrics 
(e.g. Hausman, 1978). In contrast, random effects models usually assume that 
the random effects and the covariates are uncorrelated. However, in random 
effects models this assumption can be relaxed for some covariates by including 
the cluster means of these covariates as additional predictors in the model; 
see page 52. 

Unfortunately, relying on conditional maximum likelihood severely limits 
the types of models that can be estimated. The sufficient statistic required to 
construct a conditional likelihood only exists for simple models with cluster- 
specific intercepts. Furthermore, it is required that the models belong to the 
exponential family, having canonical links. For continuous responses, this is the 
standard identity link, for dichotomous responses the logit link (in contrast to 
for instance the probit), for counts the log link and for unordered polytomous 
responses the multinomial logit (e.g. Andersen, 1973; Chamberlain, 1980). 
Even simple models with cluster-specific intercepts cannot be estimated if they 
include cluster-specific covariates. As discussed on pages 81 to 84 the estimated 
regression parameters for within-cluster covariates therefore reflect only the 
within-cluster effects. In contrast, random effects estimates are a weighted 
average of within-cluster and between-cluster effects (equal to the within- 
cluster effects in the case of balanced data). Another drawback compared to 
random effects models is that we cannot have a structural model where cluster- 
specific fixed effects are regressed on covariates or other cluster-specific effects. 


6.11 Bayesian methods 

6.11.1 Introduction 


In the Bayesian approach there is no distinction between latent variables and 
parameters; all are considered random quantities. Let D denote the observed 
data and L denote parameters as well as latent variables (including missing 
data). Inference requires setting up a joint probability distribution Pr(D,L) 
over all random quantities. The joint distribution comprises two parts: the 
likelihood Pr(D|L) and the prior distribution Pr(L). Specifying Pr(D|L) and 
Pr(L) gives a full probability model, where 

Pr(D,L) = Pr(D|L)Pr(L). 


Having observed D, Bayes theorem is used to obtain the posterior distribution 
of L given D, 


Pr(L|D) 


Pr(D|L)Pr(L) 
/Pr(D|L)Pr(L) dL' 


Loosely speaking, the posterior distribution updates prior ‘knowledge’ (repre¬ 
sented by the prior distribution) with information in the observed data (repre- 
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sented by the likelihood). Note that the posterior distribution is proportional 
to the product of the prior distribution and the likelihood. 

All the information regarding the unknown quantities is contained in their 
posterior distribution given the data D. However, it is difficult if not impossible 
to comprehend a posterior distribution with possibly thousands of dimensions 
and Bayesians thus typically summarize the information in the posterior. A 
preferred summary is the posterior expectation of the parameters, since these 
‘estimates’ minimize the posterior expectation of a quadratic loss function 
(mean squared error of estimates) • The expectation also has the merit that its 
value for a subset of parameters is invariant with respect to marginalization 
over the remaining parameters. In contrast, the posterior modes are not in¬ 
variant under marginalization. Other features of the posterior are also used for 
Bayesian inference, including moments, quantiles and highest posterior den¬ 
sity regions. All these quantities can be expressed as posterior expectations of 
functions of L. 

6.11.2 Bayes modal or modal a posteriori (MAP) 

Before the advent of Markov Chain Monte Carlo (MCMC) methods (see be¬ 
low), Bayes modal or modal a posteriori (MAP) methods were often used to 
approximate expectations, since modes are often easier to approximate numer¬ 
ically. Lindley and Smith (1972) suggested that inference could be based on the 
joint posterior mode which can be obtained by using standard optimization 
methods. However, posterior expectations of subsets of parameters are gen¬ 
erally better approximated by the mode after marginalization over the other 
parameters (e.g. O’Hagan, 1976). Inference regarding structural parameters 
0 would thus preferably proceed by considering the mode of marginal poste¬ 
riors with incidental parameters and latent variables integrated out. Another 
problem with the use of joint posterior modes is that an incidental parameter 
problem may arise, invalidating the asymptotic normality of Bayes modal pre¬ 
dictions. Thus, care must be exercised to ensure that Bayes modal predictions 
coincide with Bayes (mean) predictions in large sample situations. In the con¬ 
text of IRT models, the joint posterior mode has been used by Swaminathan 
and Gifford (1982, 1985, 1986) for the one, two, and three-parameter logistic 
models respectively. 


6.11.3 Hierarchical Bayesian models 

Latent variable models can be viewed as hierarchical Bayesian models. This is 
because the prior distributions of some of the parameters, namely the latent 
variables, depend on further parameters (the variances and covariances of 
the latent variables) known as hyperparameters. The distributions of these 
hyperparameters are known as hyperpriors. At stage one the distribution 
Pr(D|C,t^i) of the responses is specified conditional on the latent variables C 
and parameters 汐 i. At stage two, the prior distribution Pr(t^i) of the parame- 
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ters and the prior distribution Pr(^|^ 2 ) of the latent variables are specified, the 
latter depending on hyperparameters 汐 2 . At stage 3, a hyperprior Pr(^ 2 ) for 
the hyperparameters is specified. In generalized linear mixed models 汐 i would 
be regression and possibly dispersion and threshold parameters, whereas ^2 
would be variance and covariance parameters of the random effects. The pos¬ 
terior distribution of the parameters given the data can be written as 

Pr(C,i?i,t? 2 |D) Pr(D|C,^i.)Pr(t?i)Pr(C|i? 2 )Pr(t? 2 ). (6.26) 

Thus, the term ‘hierarchical’ does not refer to the data structure but to this 
sequential model specification. Bayesian hierarchical linear models are pre¬ 
sented by Lindley (1971), Lindley and Smith (1972) and Smith (1973) and 
applied in Novick et al. (1973). A 4 frequentisf latent variable model can be 
viewed as ‘empirical Bayesian’ because the parameters (variances and covari¬ 
ances) of the 4 prior’ distribution for the latent variables are estimated instead 
of assuming a ‘hyperprior’ distribution. 

6.11.4 Prior distributions 

There appear to be four different motivations for using prior (and hyperprior) 
distributions, the first ‘truly’ Bayesian and the others 4 pragmatic 5 Bayesian. 
True Bayesians would specify prior distributions reflecting prior beliefs or 
knowledge regarding the parameters. For instance, factor loadings in inde¬ 
pendent clusters factor models are expected to be positive and this prior be¬ 
lief could be represented by appropriate prior distributions. Another example 
would be a prior for a treatment effect in a clinical trial based on elicited 
expert opinion (e.g. Spiegelhalter et al” 1994). 

Second, a prior can be used to ensure that estimates are confined to the per¬ 
mitted parameter space, for instance to avoid 4 Hey wood cases’ where unique 
factors have negative variances (e.g. Martin and McDonald, 1975). In latent 
class modeling priors are sometimes used to prevent boundary solutions where 
conditional response probabilities approach zero or one with corresponding 
logit parameters approaching —00 and 00 . 

Third, priors can aid identification. For instance, a measurement model 
without replicate measures could be identified by specifying a prior for the 
measurement error variance. This approach may be preferable to the conven¬ 
tional approach of treating the variance as a known parameter, since the prior 
can reflect parameter uncertainty. In the same vain, priors have been used 
as a cure for the problem of excessive standard errors in the three-parameter 
logistic IRT model (Wainer and Thissen, 1982). 

Finally, perhaps the most prevalent pragmatic reason for using priors is that 
Markov chain Monte Carlo (MCMC) methods (described below) can then be 
used for estimating complex models for which other methods perform badly or 
are unfeasible, for instance generalized linear mixed models with crossed ran¬ 
dom effects. The prior is typically specified as 4 noninformative 5 (also denoted 
as flat，vague or diffuse) to minimize its effect on statistical inference. The 
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likelihood component of the posterior then dominates so that the posterior 
becomes nearly proportional to the likelihood. This pragmatic approach is for 
instance reflected in modern books on Bayesian modeling such as Congdon 
(2001) and the examples accompanying the popular BUGS software (Spiegel- 
halter et al, 1996bc). 


6.11.5 Markov chain Monte Carlo 

The aim of MCMC is to draw parameters and latent variables (f 三 L from the 
posterior distribution to obtain the expectation of a function Drawing 

independent samples from h{(p) as in crude Monte Carlo integration described 
in Section 6.3.3 may not be feasible. However, consistent estimators of the 
expectation E[/(y>)] can be obtained from dependent samples as long as the 
samples are drawn throughout the support of h(cp) in correct proportions. This 
can be accomplished by using a Markov chain with the target distribution h((p) 
as its stationary distribution, leading to Markov chain Monte Carlo (MCMC). 

Let {(^ ⑼， y^ 1 ), …} be a sequence of random variables. In a first order 
homogenous Markov chain the next state </?( r+1 ) is sampled from a distribution 
P(V ? ( r+ 1 )l ^ J ( r ))， which only depends on the current state cp( r ) and neither on 
the ‘history’ of the chain {#( 0 ) ， V^ 1 )，• • • ， - 工 )} nor r. Importantly, the chain 
will gradually 4 forget’ its initial state and eventually converge to a unique 
stationary distribution. To obtain the required distribution we discard the 
states up to the 4 time’ when we believe that stationarity has been reached, 
known as the 4 burn-in 5 period. Sometimes several chains with different initial 
states are used to monitor convergence to a stationary distribution. After the 
burn-in, we need to run the chain sufficiently long for the sample averages of 
f((p) to reliably estimate the required expectations; determining how long is 
not trivial since the draws are dependent. 

There are several ways of constructing a Markov chain with the target 
distribution h((f) as stationary distribution. We start with the most complex 
algorithm and gradually proceed to the simpler. 

The Metropolis-Hastings algorithm 

From v?( r ), the next state y>( r+1 ) is obtained as follows: 

Step 1: Sample a candidate point Cp from some proposal distribution q((p\(p^). 
For instance, g(<^|</?( r )) could be multivariate normal with mean y>( r ) and 
fixed covariance matrix. 

Step 2: Accept the candidate point with probability 

Roughly speaking, the algorithm proceeds in the following way: In Step 1 we 
sample from a convenient but incorrect distribution. In Step 2 we correct for 
this in a correct but rather nonintuitive way. 
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If the candidate point is accepted, the next state becomes <^( r *+ 1 ) = Cp\ if 
the candidate is rejected, the chain does not move and <p( r+1 ) = Use 
of the Metropolis-Hastings algorithm thus requires the ability to draw from 
the proposal distribution and calculate the fraction involved in obtaining the 
acceptance probability. Importantly, the target distribution need not be nor¬ 
malized so that the denominator of posterior distributions may be omitted. 

The stationary distribution will be whatever proposal distribution 

q((p\(p^) is used. However, the rate of convergence will of course depend on 
how close the proposal distribution is to the target distribution. 

The Metropolis algorithm 

This algorithm is the special case of the Metropolis-Hastings algorithm where 
only symmetric proposal distributions where q(cp\(p^) = such as 

the multivariate normal or t distributions, are considered. In this case the 
acceptance probability simplifies to 

+ ㈦ 別 = 

which does not depend on the proposal distribution. The random walk Metropo¬ 
lis algorithm arises as the special case where q((f\(f^) = q(\(p — <p( r )|). 


The single components Metropolis-Hastings algorithm 


Consider now the partitioning of the vector (p into k components, which can 
be blocks and not necessarily scalars, cp = {外，列，…， 仰 }. Instead of updat¬ 
ing the entire vector (p, single-component methods update the k components 
(fi one at a time. Let denote the state of component at the end of 
iteration r. Define = {(^ r+1 )， …， d 二 f)，, 4)} as V 5 without its 
ith element, after completing step i — 1 of iteration r + 1. We have presumed 
a fixed updating order, although different types of random order are possi¬ 
ble. It may also be beneficial to update highly dependent components more 
frequently than others (e.g. Zeger and Karim, 1991). 

In the single components Metropolis-Hastings algorithm the candidate (pi 
is drawn from the proposal distribution q((pi\cp] r \ . The candidate is ac¬ 
cepted with probability 






Here, is the full conditional distribution for cpi under h(<p), the 

distribution of the i th component of (p conditioning on all the remaining com¬ 
ponents. Importantly, (fi, (pi) simplifies when h(cp) derives from a con¬ 

ditional independence model. 
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The Gibbs sampler 

The basic idea of the Gibbs sampler is to utilize the fact that conditional 
distributions of (blocks of) random variables may be relatively simple in 
spite of a complicated joint distribution. It is a special case of the single 
components Metropolis-Hastings algorithm where the proposal distribution 
^-i) f or updating the i th component of (p is the full conditional 
distribution . Note that substituting ft ( 办 |^> 匕 )） for (f^j) 

in produces an acceptance probability 



so candidates are always accepted in Gibbs sampling. Hence, the target dis¬ 
tribution is simulated by performing a random walk on the vector </?, altering 
one of its components at a time. The optimal scenario for the Gibbs sampler 
is where the components of cp are independent in the target distribution, in 
which case each iteration produces a new independent draw of (p. If the compo¬ 
nents are highly correlated, convergence can be improved by orthogonalizing 
the components. 

Straightforward use of the Gibbs sampler requires that samples can be 
drawn from the full conditional distributions derived from the target distri¬ 
bution. When this is impossible, the more complicated Metropolis-Hastings 
algorithm may be used. Alternatively, Gilks and Wild (1992) suggested using 
adaptive rejection sampling for the common case of univariate and log-concave 
full conditional distributions (see also Dellaportas and Smith, 1993). This ap¬ 
proach is implemented in the BUGS software (Spiegelhalter et al” 1996a). For 
continuous conditional distributions that are difficult to simulate from, Albert 
and Chib (1993) suggest simulating from discretized versions. 

Example: Gibbs sampling for random intercept probit model 
Consider the random intercept probit model 

P r (y«j = i| x «j) = 伞 ( x ^/^ + G)， 

where Q 〜 N(0, 分 ).It is useful to express this model as a latent response 
model 

y*j = x y/3 + Ci + £ij> 

where 〜 N(0,1), independent of Q. The latent responses are related to 
observed responses yij via the threshold function 

... = / 1 if ^>° 

- \ 0 if y*. < 0. 


For the regression coefficients (3 we consider the conjugate multivariate 
normal prior 


/3 〜 N P (/3 0 ,E^), 

with (3 0 and 5^ presumed known. A conjugate inverse-gamma (IG) density 
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is specified for -0, 


Pr(^) = -0 _ ( a+1 ) exp - ^, *0>O, 

r ⑷ 

where a>0 and b>0 are known shape and scale parameters, respectively, 
and r(.) is the gamma function. It is assumed that ^ and /3 are independent. 
The posterior distribution Pr(y* 乂，矽 , /3|y,X) is the distribution of the 
unknown parameters (f3 and VO and latent variables C and latent responses 
y*, given the data X and y. The posterior can be expressed as 
Pr(y*,C,^/3|y,X) oc [Pr(y|y*)Pr(y*|C ， /3,X)Pr(C|V0] Pr ⑼ Pr(/3). (6.27) 
Here, the normalizing constant is the marginal distribution of the observed 
responses y. The joint density of y, y*, and C (given f3 and #) in square 
brackets, referred to as the 4 augmented complete data likelihood’，takes the 
form 

J ( n 3 

Pr(y,y*,C|/3,^X) = II \ >0 ) 7 (^ = < 0 )^ = 0 )] 

j=i U=i 

X 0,1) 1^(0 ； 0,^), 

where /(•) is the indicator function. Substituting this expression into the 
posterior in (6.27), it is evident that the normalizing constant does not have 
a closed form, making it difficult to simulate directly from the posterior 
distribution. 

Fortunately, the Gibbs sampler can instead be applied since the full condi¬ 
tional distributions for the latent responses, parameters and latent variables 
are simple: 

1. Independent truncated normal full conditionals of j =1 ， … ， J i = 

1 ， … ， Uj: 

PF(^lX,y,y*X,i>,0) = r(y ： 3 ; 今 _4, WK; 4 概句 

where 0 一 （.）is a left-truncated normal density equal to 0 when y*j < 0 
and 0+(.) is a right-truncated normal density equal to 0 when > 
0. To simulate from the truncated normals, we can simulate from 
a normal density and discard the draws falling outside the permitted 
interval. To avoid this ‘waste’ of simulations, we can alternatively first 
generate a random uniform (0,1) variate Uij. Then if yij = 1 we calculate 
Vij = 少 _1 [1 - ^(^/3+^)[1 - u^]] and if yij = 0 we calculate = 

[[1-$(4/3+0)]^']. 

Importantly, having simulated the full conditionals of the other ran¬ 
dom variables become independent of yij. 

2. Multinormal full conditional Pr(/3|0, C ， y* ， X,y) of /3 with mean = 

/3 0 +Y^j=i^j(yj - an( l covariance matrix =(S^ 1 + 

M)- 1 . 
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3. Independent normal full conditionals of Q, j = 1,, J: 

Pr(Cj|/3,^,y*,X,y) = 0(0； (yp + rij )- 1 + %) -1 ). 

i=l 

The mean of the conditional is a special case of the so-called empirical 
Bayes predictor derived in (7.5). Note that the full conditional would 
not be multinormal if we had not augmented the data with y*j. 

4. Inverse-gamma full conditional of 

Pr(V>|/3,C,y*,X,y) = 

x exp [-^~ 1 (6 + 

j=i 

Gibbs sampling simply proceeds by sampling from the full conditional dis¬ 
tributions above from some starting values and iterating to a stationary 
distribution. 

Note that the straightforward application of the Gibbs sampler in our ex¬ 
ample rests on a judiciously chosen set-up, involving the probit specifica¬ 
tion and data augmentation with latent responses (e.g. Tanner and Wong, 
1987). More complex procedures must be invoked in other cases, for in¬ 
stance for the random intercept logit model where Zeger and Karim (1991) 
suggested using rejection sampling, Spiegelhalter et al. (1996a) adaptive 
rejection sampling (Gilks and Wild, 1992) and Browne and Draper (2004) 
a hybrid Metropolis-Gibbs approach. 


Alternating imputation posterior algorithm 

A special kind of MCMC algorithm has been suggested by Clayton and Ras- 
bash (1999) for generalized linear models with crossed random effects. Their 
algorithm is based on the imputation posterior (IP) algorithm of Tanner and 
Wong (1987) which iterates between an ‘imputation’ (I) step and a poste¬ 
rior 5 (P) step. The algorithm is similar to Gibbs sampling except that, in the 
P-step, the whole parameter vector is sampled from its conditional posterior 
distribution given the latent variables, instead of single components. 

The algorithm can be outlined as follows: 

• I-step: Draw a sample C r from the posterior distribution of C given y, ^ r_1 
and X (data augmentation). 

• P-step: Draw a sample from the posterior distribution of ^ given y 
andX. 

As in Gibbs sampling, the algorithm is run until the stationary distribution 
has been reached (for a burn in period) and the parameters are estimated by 
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their mean 


i ^ 

E(t?|y,X) « 

r=l 

In the I-step, the usual empirical Bayesian posterior distribution of the 
latent variables for fixed parameters is used (see Section 7.2) but with pa¬ 
rameters set equal to ^ instead of the maximum likelihood estimates. In the 
P-step, the random effects drawn in the previous iteration are treated as fixed 
offsets. The posterior distribution of the parameters is then approximated by 
a multivariate normal distribution with mean given by the maximum likeli¬ 
hood estimates ^ (treating as offsets) and covariance matrix 5] derived 
from the Hessian. This approximate 4 sampling distribution’ approximates the 
true Bayesian posterior if uniform priors are assumed for all parameters 
The variance of the parameter estimates is then estimated by (using Rao- 
Blackwellization) 

Var(t?|y,X) « | E ^ - ^ - ^ 

the sum of within and between-imputation variances. Clayton and Rasbash 
(1999) point out that this Rao-Blackwellization cannot be used in Gibbs sam¬ 
pling where individual parameters are sampled from their conditionals (not 
the entire parameter vector ^ as here), since the conditional covariances (the 
off-diagonal elements of 5] r ) are in this case not available. 

Note that Clayton and Rasbash (1999) argue that ideally, the P-step should 
consist of two parts: (1) use REML to obtain an approximate posterior for 
the variance and covariance parameters and draw samples from this posterior 
and (2) approximate the posterior of the fixed parameters by a multivariate 
normal distribution with mean and variance from the ML solution setting the 
variance parameters equal to the draws from (1) • 

Clayton and Rasbash (1999) use this algorithm to estimate models with 
crossed random effects. In their application women were artificially insemi¬ 
nated on several occasions using sperm from different donors and the response 
was success or failure for each attempt. Since sperm from each donor was also 
used to inseminate different women, the woman-specific random effects are 
crossed with donor-specific random effects. In their Alternating Imputation- 
Posterior (AIP) algorithm, Clayton and Rasbash therefore alternated between 
donor and woman 4 wings’ of the IP algorithm. In the donor wing, one iter¬ 
ation of IP is carried out, treating the woman-specific effects as offsets and 
drawing a new sample of donor-specific random effects. In the woman wing, 
the donor-specific random effects are treated as offsets in one iteration of IP 
to update the woman-specific random effects. The wings are alternated until 
convergence. 

Obtaining separate estimates of the model parameters from the two wings 
allows convergence of the algorithm to be assessed. The final estimates are 
averages over both wings. Ecochard and Clayton (2001) suggest running both 
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wings in parallel. In Section 11.4 we use a modified version of the AIP al¬ 
gorithm for disease mapping with spatially correlated random effects where 
quadrature methods cannot be used. 


6.11.6 Advantages and disadvantages of MCMC 

MCMC methods have been used for a variety of latent variable models, in¬ 
cluding generalized linear mixed models (e.g. Zeger and Karim, 1991; Clayton, 
1996b), multilevel models (e.g. Browne, 1998), covariate measurement error 
models (e.g. Richardson and Gilks ， 1993)，disease mapping (e.g. Mollie, 1996), 
multilevel factor models (e.g. Goldstein and Browne, 2002) and multilevel item 
response models (e.g. Ansari and Jedidi ， 2000; Fox and Glas ， 2001). Numerous 
applications can be found in Congdon (2001) and Spiegelhalter et al. (1996bc). 
We use MCMC for a meta-analysis in Section 9.5 and for disease mapping in 
Section 11.4. 

An important merit of MCMC is that the approach can be used to estimate 
complex models for which other methods are either unfeasible or work poorly. 
Another advantage is that any characteristics of the posterior distribution can 
be investigated based on stationary simulated values, for instance posterior 
means and percentiles of ranks of random effects in institutional comparisons 
(e.g. Goldstein and Spiegelhalter, 1996). 

There are a number of more or less controversial issues that must be settled 
in using MCMC methods. First, the bum-in, the number of initial iterates 
to discard because of dependence on the starting values, must somehow be 
determined. Unfortunately, it is considerably more difficult to monitor conver¬ 
gence to a distribution than to a point (Gelman and Rubin, 1996). A popular 
approach is to use an arbitrary large number, or to run a number of chains 
with different initial states to assess convergence. However, recommendations 
regarding the number of chains that one should rely on have been conflicting, 
including several long chains, one very long chain and several short chains 
(the latter seems to be misguided). 

Another problem is deciding when to stop the chain to ensure acceptable 
precision of the estimates. It can be particularly hard to judge convergence of 
the estimates when there is slow mixing, that is, when the chain moves slowly 
“through bottlenecks of the target distribution” (Gelman and Rubin, 1996). 
When mixing is poor, the chain has to be run for a very long time to obtain 
accurate estimates. 

Although the Bayesian approach can be useful for identification ， implemen¬ 
tation via MCMC makes it hard to discover lack of identification (e.g. Keane, 
1992). One reason is that a flat posterior would not be detected as a natural 
byproduct of estimation, in contrast to maximization using gradient meth¬ 
ods. Inadequate mixing of the chain could moreover falsely indicate that an 
unidentified parameter has been estimated with reasonable precision. 

Another problem concerns the specification of noninformative priors for 
variance parameters in random effects models (e.g. Natarajan and Kass ， 2000; 
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Hobert, 2000). For instance, Hobert (2000) points out that the prior used by 
Zeger and Karim (1991) in Gibbs sampling can lead to an improper posterior 
distribution (not integrating to one) although all full conditionals are proper. 
Unfortunately, this problem may not be apparent from the Gibbs output. 
Moreover, the prior is not noninformative in any formal sense even if a proper 
posterior is obtained, meaning that the prior is actually driving the inferences. 
This problem is shared by ‘diffuse’ proper priors which are typically proper 
conjugate priors that are nearly improper. Furthermore, these priors may lead 
to Gibbs samplers which converge very slowly. Hobert (2000) concludes that 
choosing a prior for a variance parameter is currently a real dilemma for a 
Bayesian with no prior information. Hobert also raises concerns about the 
theoretical properties of estimated standard errors. 


6.12 Summary 

The advantage of full maximum likelihood through explicit integration is (1) 
consistency if data are missing at random (MAR) and (2) the availability of 
a likelihood for likelihood based inference. Accuracy can be improved and 
assessed by increasing the number of quadrature points or Monte Carlo repli¬ 
cations. Furthermore, the methods are applicable for the general model frame¬ 
work. The drawback is computational inefficiency. The Gauss-Hermite meth¬ 
ods in particular become computationally demanding as the number of latent 
variables increases. The 6th order Laplace approximation by Raudenbush et 
al. (2000) appears to be very efficient and may be sufficiently accurate in many 
situations. 

Muthen^ limited information approach is an excellent alternative for a gen¬ 
eral class of models with multinormal latent responses. However, cluster sizes 
must be (nearly) constant with either few missing data or few missing data 
patterns. The estimation method is computationally extremely efficient and 
appears to produce estimates that are very close to maximum likelihood, ex¬ 
cept for complex models with small samples. Surprisingly, this approach has 
received scant attention in the multilevel modeling literature. A related lim¬ 
ited information method is the 4 pseudo-likelihood ， approach; see e.g. le Cessie 
and van Houwelingen (1994) and Geys et al. (2002). 

MQL and PQL are also very computationally efficient whatever the number 
of latent variables. Unfortunately, these methods sometimes produce severely 
biased estimates. Unlike methods based on integration, the accuracy cannot 
be improved gradually, making it difficult to assess the reliability. This can be 
rectified by using parametric bootstrapping for bias correction as suggested by 
Kuk (1995) (see also Goldstein, 2003). MQL and PQL are currently confined 
to generalized linear mixed models. 

MCMC methods allow estimation of a very wide range of models and have 
become increasingly popular. However, this flexibility can lead to specification 
of overly complicated models that may not be identified and where it is diffi¬ 
cult to assess the impact of the prior distributions. These problems are often 
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exacerbated by inadequate description of the model and lack of tables with 
estimates of all model parameters and their standard errors. Furthermore, 
convergence can be slow and difficult to monitor. 

For some applications where latent variable models are for some reason not 
considered appropriate, GEE or fixed effects methods may be useful. 

Unfortunately, there is a paucity of simulation studies comparing the per¬ 
formance of different estimation methods, assessing the effects of different 
‘factors’ such as cluster size, intraclass correlation, sample size etc. on perfor¬ 
mance in the systematic way proposed by Skrondal (2000). Useful overviews 
of different estimation methods for latent variable models are given in Bres- 
low (2003)，McCulloch and Searle (2001), Rabe-Hesketh et al. (2002) and 
Goldstein (2003). 
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Appendix: Some software and references 

For each of the main methods discussed in this chapter, we now list some 
references as well as software implementing the method. Note that omission 
of software is not informative about its quality. We do not provide addresses 
or links to the software since such information is quickly out of date and can 
easily be found on the internet. 

• Closed form marginal likelihood (Section 6.2) 

— Useful references: Browne and Arminger (1995) for structural equation 
modeling with latent variables (other methods are also discussed). 

— Software: 

* AMOS (Arbuckle and Wothke, 1999, 2003) 

* EQS (Bentler ， 1995) 

* LISREL (Joreskog and Sorbom, 1994; Joreskog et al, 2001) 

* MECOSA (Arminger et al, 1996) 

* MX (Neale et al, 2002) 

• Laplace approximation 

- Useful references: Tierney and Kadane (1986)，Tanner (1996, Section 
3.2)，and Raudenbush et al. (2000). 

— Software: HLM for sixth order Laplace in multilevel generalized linear 
mixed models (Raudenbush et al” 2001). 

• Gauss-Hermite quadrature 

— Useful references: Stroud and Secrest (1966)，and Davis and Rabinowitz 
(1984). 

- Software: 

* aML for multilevel and multiprocess models (Lillard and Panis ， 2000) 

* BILOG-MG for binary logistic item-response models (e.g. Du Toit ， 
2003) 

* EGRET for two-level random intercept models (EGRET for Windows 
User Manual, 2000) 

* gllamm for generalized linear latent and mixed models in St at a (Rabe- 
Hesketh et al, 2001b, 2004c) 

* LIMDEP for two-level random intercept models (Greene, 2002a) 

* MIXNO for two-level multinomial logit models (Hedeker, 1999) 

* MIXOR for two-level ordinal logistic and probit regression (Hedeker 
and Gibbons, 1996a) 

* MIXREG for two-level linear mixed models with autocorrelated errors 
(Hedeker and Gibbons, 1996b) 

* MIXSUR and MIXPREG for counts and discrete-time durations (by 
Hedeker) 

* MULTILOG for multinomial logit item-response models (e.g. Du Toit, 
2003) 
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* PARSCALE for ordinal probit and logit item response models (e.g. 
Du Toit, 2003) 

* SABRE for two-level generalized linear mixed models (Francis et 
al. ， 1996) 

* Stata’s xt commands for two-level random intercept models (Stata 
Cross-Sectional Time-Series, 2003; StataCorp, 2003) 

* TESTFACT for exploratory multidimensional probit factor models 
(Bock et al, 1999; Du Toit, 2003) 



- Useful references: Pinheiro and Bates (1995), Bock and Schilling (1997), 

Evans and Swartz (2000), and Rabe-Hesketh et al. (2002, 2004b). 

— Software: 

* gllamm for generalized linear latent and mixed models in Stata (Rabe- 
Hesketh et al, 2001b, 2002, 2004c) 

* NLMIXED for two-level generalized linear mixed models in SAS (Wolfin- 
ger, 1999) 

* TESTFACT for exploratory multidimensional probit factor models 
(Bock et al, 1999; Du Toit, 2003) 

• Monte Carlo integration: 

- Useful references: Train (2003)，Cappellari and Jenkins (2003)，and Gourier- 

oux and Montfort (1996). 

— Software: 

* mvprobit for multivariate probit regression in Stata (Cappellari and 
Jenkins, 2003) 

* NLOGIT for multinomial logit and probit and other discrete choice 
models with random effects (Greene, 2002b) 

* DCM in Ox (Eklof and Weeks, 2003) 

* Mixed logit estimation routine for panel data in GAUSS (Train et al., 
1999) 

• EM algorithm: 

— Useful references: Tanner (1996), Schafer (1997), McLachlan and Krish- 

nan (1997) and Little and Rubin (2002). 

— Software (some examples): 

* HLM for multilevel generalized linear mixed models (Raudenbush et 
al” 2001) 

* Latent GOLD for latent class and related models with many different 
response types (Vermunt and Magidson, 2000, 2003a) 

• Gradient methods 

- Useful references: Judge et al (1985), Fletcher (1987), Gould et al. (2003 )， 

Thisted (1987), and Everitt (1987). 
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— Software for maximizing arbitrary likelihood: 

* Stata ? s ml command (Gould et al., 2003) 

* GAUSS’s add-on application CML (Schoenberg, 1996) 

• Limited information: 

- Useful references: Skrondal (1996), Kusters (1987) and Muthen and 
Satorra (1995). 

— Software: 

* Mplus (Muthen and Muthen, 1998, 2003) 

* MECOSA (Arminger et al, 1996), 

• (Marginal and penalized) quasi-likelihood: 

- Useful references: Goldstein (2003) and Breslow (2003). 

- Software: 

* HLM for multilevel generalized linear mixed models (Raudenbush et 
al” 2001) 

* MLwiN for generalized linear mixed models (Rasbash et al” 2000) 

* GLIMMIX for generalized linear mixed models in SAS (SAS/Stat 
User’s Guide, version 8, 2000) 

* GLMM for generalized linear mixed models in Genstat (Payne, 2002) 

* glmmPQL for generalized linear mixed models in S and R (Venables 
and Ripley, 2002) 

• Hierarchical-likelihood (h-likelihood): 

- Useful references: Lee and Nelder (1996, 2001) 

- Software: 

* HG procedures in Genstat (Payne, 2002) 

• Generalized Estimating Equations: 

- Useful references: Pickles (1998), Hardin and Hilbe (2002), and Molen- 
berghs (2002). 

— Software: 

* Stata’s xtgee command (Stata Cross-Sectional Time-Series, 2003) 

* GENMOD in SAS (SAS/Stat User’s Guide, version 8, 2000) 

* GEE in Genstat (Payne, 2002) 

• Joint maximum likelihood 

— Useful references: Hambleton and Swaminathan (1985) for IRT; Lan¬ 
caster (2000) on the incidental parameter problem. 

• Conditional maximum likelihood: 

— Useful references: Clayton and Hills (1993), Breslow and Day (1980), 
Hamerle and Ronning (1995) and Cameron and Trivedi (1998). 
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— Software: 

* EGRET (EGRET for Windows User Manual, 2000) 

* Stata’s clogit, xtpoisson ， etc. commands (Stata Cross-Sectional 
Time-Series, 2003) 

* LOGISTIC in SAS (SAS/Stat User’s Guide, version 8, 2000) 

• Bayes: 

- Useful references: Gelman et al. (2003) on ‘pragmatic’ Bayesian statis¬ 
tics; Casella and George (1992) on the Gibbs sampler; Chib and Green¬ 
berg (1995) on Metropolis-Hastings; Rubin (1991), Gelman and Rubin 
(1996), Gilks et al (1996), and Gilks (1998)，on different MCMC meth¬ 
ods; Zeger and Karim (1991)，Albert (1992), Albert and Chib (1993), 
Dellaportas and Smith (1993), Clayton (1996b), Arminger and Muthen 
(1998) and Fox and Glas (2001) on MCMC for generalized linear mixed 
models, structural equation models and IRT models. 

— Software: 

* BUGS and WinBUGS for general Bayesian and hierarchical Bayesian 
models (Spiegelhalter et al ； 1996abc; see also Congdon ， 2001) 

* GLMMGibbs for generalized linear mixed models by Gibbs sampling 
in R (Myles and Clayton, 2001) 

* MLwiN for generalized linear mixed models and multilevel factor 
models (Rasbash et al” 2000; see also Goldstein and Browne, 2002) 

* Routine for mixed logits with bounded distributions in Gauss (Train, 

2002) 
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CHAPTER 7 


Assigning values to latent variables 


7.1 Introduction 

In this chapter we discuss methods for assigning values to latent variables for 
individual clusters. The clusters could for instance be subjects in measurement 
and longitudinal modeling or schools in multilevel modeling. For continuous 
latent variables we will refer to this as latent scoring (factor scoring or random 
effects scoring) and for discrete latent variables as classification. Note that 
this terminology does not distinguish between different response types; factor 
scoring would for instance include scoring for IRT models with dichotomous 
responses. 

Sometimes scoring and classification are the main aims of latent variable 
modeling，canonical examples being ability scoring based on IRT models and 
medical diagnosis based on latent class models. Other examples include disease 
mapping, small area estimation and assessments of institutional performance. 

In the previous chapter we considered estimation of the fundamental pa¬ 
rameters Here we assume that these parameters have been estimated as 
办 ， yielding the structural parameter estimates 0 = h(i?). Sometimes, for in¬ 
stance in educational testing, the parameters are estimated using data from 
a large calibration sample which does not include the clusters to be scored. 
Advantages of this approach are that the parameter estimates are very pre¬ 
cise. However, transporting the estimates across different populations can be 
problematic. 

Unlike the previous chapter, we focus exclusively on frequentist methods, 
treating the estimated structural parameters as known. In the Bayesian ap¬ 
proach both parameters and latent variables are treated as random variables, 
so there is no fundamental distinction between parameter estimation and la¬ 
tent scoring or classification. In the empirical Bayesian approach, inference 
regarding the latent variables is based on the conditional posterior distribu¬ 
tion given the parameters (with estimates plugged in). The empirical Bayesian 
posterior distribution is discussed in Section 7.2. 

We consider three methods of assigning values to latent variables that are 
motivated from general statistical principles. Prediction using empirical Bayes 
(EB) (also called expected a posteriori, EAP) is discussed in Section 7.3 and 
prediction using empirical Bayes modal (EBM) (also called modal a poste¬ 
riori, MAP) in Section 7.4. Estimation using maximum likelihood (ML) is 
treated in Section 7.5. Empirical Bayes is the most common approach for la¬ 
tent scoring, whereas empirical Bayes modal is the most common approach 
for classification. 
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For each of the scoring methods different notions of variability of latent 
scores are contrasted. This is especially useful for empirical Bayes prediction 
where there is confusion in the literature regarding the meaning of different 
types of variances. 

As well as discussing scoring and classification for the general model, we 
also investigate the special case of models without a structural part (e.g. factor 
models and random effects models) with multivariate normal latent variables 
and responses, henceforth for brevity referred to as the 4 linear case!. It turns 
out that many familiar scoring methods can be recognized as special cases of 
the general approaches, providing a deeper motivation for and understand¬ 
ing of these methods. The closed form expressions for the linear case’ are 
also helpful for discussing concepts such as ‘shrinkage’. To aid interpretation, 
the expressions are also presented for the special case of a two-level random 
intercept model. For more complex linear models，we find it instructive to 
substitute numerical values into the formulae in order to ‘see’ what happens 
in concrete examples. In Section 7.6 we demonstrate how the three scoring 
methods are related in the 4 linear case’. 

Since ad hoc methods are often used for latent scoring of hypothetical con¬ 
structs, such approaches are briefly discussed in Section 7.7. In Section 7.8 we 
explore the wide range of uses of latent scoring and classification. A list of soft¬ 
ware implementing different methods for assigning values to latent variables 
is provided in an appendix. 

Our discussion is confined to continuous latent variables unless otherwise 
indicated. We also confine our explicit treatment to disturbances or residuals 
C (with zero means) instead of the r], which in addition to the disturbances 
may be composed of regressions on observed covariates as well as on other 
latent variables. This is to simplify notation and because the disturbances are 
often of interest in their own right. Since the structural equations are linear, 
scores for the corresponding r) can be obtained by substituting the scores for 
the disturbances into the reduced form for the latent variables (4.18). If there 
is no structural model, we simply have ( = tj. 


7.2 Posterior distributions 

Frequentists often turn to Bayesian principles when assigning values to latent 
variables. The reason for this is that the latent variables can be interpreted as 
random ‘parameters’ (e.g. random effects) with a ‘prior’ distribution 
making the models appear similar to Bayesian models. An important differ¬ 
ence is that the fully Bayesian approach discussed in Section 6.11 would also 
assume a prior for the structural parameters 0 in addition to the prior for the 
disturbances C. In this case the priors for the parameters of the prior for C? 
e.g. the variances and covariances of the latent variables, would be referred to 
as hyperpriors. A Bayesian would base parameter estimation as well as latent 
scoring on posterior distributions given the responses y. The relevant poste- 


© 2004 by Chapman & Hall/CRC 









rior for scoring would be marginal with respect to 0, whereas the posterior 
for parameter estimation would be marginal with respect to C- 

When it comes to latent scoring, frequentists typically adopt an empirical- 
Bayesian approach. They estimate the parameters 0 by maximum likelihood 
(or another method) but rely on the conditional posterior distribution of the 
latent variables, given the estimated parameters for prediction. For conve¬ 
nience we will in』he sequel use Bayesian terminology, which would be tech¬ 
nically correct if 0 were not estimated model parameters but fixed constants. 

We have three different sources of information concerning the disturbances 
C- The first piece of information is the prior distribution h(C] 0) representing 
our a priori knowledge about the latent variables, typically specified as mul¬ 
tivariate normal when they are continuous. The second piece of information 
is provided by the observed responses y ‘measuring’ the latent variables. The 
third piece of information, which may not always be available, are covariates 
X in the response and/or structural models. 

Note that it is not always clear whether one should use covariate informa¬ 
tion in models for latent scoring and classification. For instance, most people 
would agree that covariates such as age, gender and ethnicity should be used in 
diagnosis of heart disease if this reduces the risk of misdiagnosis. In contrast, 
use of such covariate information might be considered 4 politically incorrect’ or 
unfair in educational testing, even if it improves the quality of ability assess¬ 
ment (see also Section 9.4). 

A natural way of combining the sources of information is through the poste¬ 
rior distribution a;(^|y, X; 0) of C? the distribution of C updated with or given 
the data y and X. Thus the posterior provides a natural setting for inference 
concerning latent scores (see also Bartholomew, 1981). Using Bayes theorem, 
we obtain 

w (C|y,X ； 0)= 

In the general L-level model, the linear predictors of each level-1 unit in a 
top-level cluster z will generally depend on several elements of the vector of 
latent variables for that cluster C Z (L)- For example, in a three-level random 
intercept model, as shown in Display 3.2 on page 59, each response depends 
on both a level-2 and level-3 random intercept. It follows that a given re¬ 
sponse provides information on more than a single latent variable so that the 
latent variables are generally dependent under the posterior distribution. It is 
therefore useful to consider the joint posterior distribution of Cz(l), given all 
responses y z (L) and all covariates X 2 ( L ) for a top-level cluster 2 ：, 

u {Cz{L)\y z{L)^z{LY'> = 


Pr (y^(L)?C(L) l X :(L); 沒 ) 
P r (yz(L)|x z ⑹ ; 沒） 


Pr(y,cM) 

Pr(y|X;?) 


To fix ideas, consider a simple two-level random intercept model. For a given 
cluster j, the joint density of the random intercept and the responses yj 
in the numerator above can be written as /i(Ci( 2 )； 0 ( 2 )； ^)- 
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Here, we have utilized that the yij are conditionally independent given Cj( 2 )- 
The marginal density of the responses in the denominator is just the integral 
of this joint density with respect to the random intercept so that the posterior 
density becomes 


a ； (Ci(2)|yj(2)j X i(2 )； = 


MG(2); 咨 )n 夕 (1) ， 0(2)；^) 

f HCj(2) ； n ^ (1) (yij \ x ij y 0(2) ； 0) dCj(2) 


Observe that the posterior cannot in general be expressed in closed form 
because the integral in the denominator does not have an analytical solution. 
Since the denominator is just the likelihood contribution of the jth. level-2 
unit, the problem is the same as that of evaluating the marginal likelihood 
discussed in Section 6.3. 

Writing the general L-level model as a two-level model in terms of ViL) an d 
(⑹ as shown in Section 4.2.2, the expression above also applies in the L-level 
case, 

z{L)^z{L)\ 

_ "(Qzp; 呑 ) n g ⑴ Uxi ，.』， Qz,) ； g) 


where the products are now over all level-1 units within the zth. level-L unit. 
However, the integral in the denominator has a very high dimensionality and 
numerical integration should make use of the conditional independence struc¬ 
ture to evaluate the integral recursively as in equation (6.2) on page 162. 

In the 4 linear case’，it follows from standard results on conditional multi¬ 
variate normal densities (e.g. Anderson, 2003) that the posterior density is 
multivariate normal (see equations (7.3) and (7.7) for the expectation vector 
and covariance matrix). For other response types, it follows from the Bayesian 
central limit theorem (e.g. Carlin and Louis, 1998) that the posterior density 
tends to multinormality as the number of units in the cluster increases. 

Finally, consider a discrete random intercept model with prior probabilities 
7r C j for locations C( 2 ) = e c , c = 1 ， •.. ， C. The posterior probability that the 
intercept equals e c is given by 


^(e c \y j{2) ,X j{2 y,d) 




In medical diagnosis using a single test result yj, these probabilities are called 
the ‘positive predictive value 5 if yj — 1 (positive test result) and c = 2 (disease 
present) and the ‘negative predictive value’ if y) = 0 (negative test result) and 
c = 1 (disease absent). In this context the prior probability is the prevalence 
of disease which represents the physician’s knowledge of the patient diagnosis 
before seeing the test result. If the test is useful, the posterior probabilities 
will be substantially closer to zero or one than the prior probabilities. See 
Section 9.3 for an application to diagnosis of myocardial infarction (heart 
attack) • 
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7.3 Empirical Bayes (EB) 

7.3.1 Empirical Bayes prediction 

Empirical Bayes prediction is undoubtedly the most widely used method for 
both factor and random effects scoring. Empirical Bayes predictors (see Efron 
and Morris, 1973, 1975; Morris, 1983) of the latent variables C(l) are their 
posterior means with parameter estimates 0 plugged in, 

Cz(L) = E (C(L) \y z{L)^z{L)\ 0). 

Whenever the prior distribution is parametric, the predictor is denoted para¬ 
metric empirical Bayes’. 

The reason for the term 4 empirical Bayes 5 is that, as noted earlier, Bayesian 
principles are adapted to a frequentist setting by plugging in estimated model 
parameters. The Bayesian would obtain the posterior distribution of the latent 
variables, marginal to 0, instead of simply plugging in estimates for 6. It is 
evident that the Bayes and empirical Bayes approaches are based on different 
philosophical underpinnings, prompting Lindley (1969) to remark: 

“there is no one less Bayesian than an empirical Bayesian.” 

Unfortunately, ambiguity still prevails in the terminology for Bayes and Em¬ 
pirical Bayes prediction. This is reflected in the term ‘expected a posteriori’ 
predictor (EAP) predominant in the psychometric literature (e.g. Bock and 
Aitkin, 1981; Bock and Mislevy, 1982; Bock, 1983, 1985; Muraki and Engel¬ 
hard Jr” 1985) which seems to imply true Bayes prediction. 

Despite the theOTetical differences between Bayes and empirical Bayes in¬ 
ference, whenever 0 is consistent, the effect of substituting estimates for pa¬ 
rameters is expected to be small if the likelihood dominates the (hyper)prior 
of 0, as in large samples and/or vague (hyper)priors. Little is known, however, 
about the consequences in the small or moderate sample situation. A reassur¬ 
ing theoretical result is provided by Deely and Lindley (1981) who point out 
that the empirical Bayes predictor is a first order approximation to the Bayes 
predictor. 

The empirical Bayes predictor can be justified by considering the summed 
quadratic loss function defined as the unweighted sum of the squared errors 
of a predictor C(l). Dropping the z subscript, 

l eb (C(l)?C(l)) = (C(l) _ C(l)) / (C(l) _ C(l))- 

Treating the parameters as known, the empirical Bayes predictor minimizes 
the expected posterior loss: 

J (C(L) _ C(L)) / (C(D _ C(i / )) a; (C(L)|y(L)5 X (L )； dC(L)- (7.2) 

Note that this can be given the intuitively pleasing interpretation as propor¬ 
tional to a posterior mean squared error of prediction, where the expectation 
is taken over the posterior distribution. 
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Searle et al. (1992, p.262) show that the empirical Bayes predictor also 
minimizes the mean squared error of prediction over the sampling distribution 
of y if the parameters are treated as known. For frequentists, this result might 
be more useful than the (empirical) Bayesian justification in terms of posterior 
loss. McCulloch and Searle (2001, p.257-258) emphasize that substitution of 
estimated parameters in the empirical Bayes predictor is purely pragmatic 
and has no statistical rationale. 

It is clear that the mean squared error loss function is meaningless for truly 
discrete latent variables where predictions must coincide with the locations of 
the latent classes. Empirical Bayes prediction should therefore not be used in 
this case. However, empirical Bayes prediction can be used in nonparametric 
maximum likelihood estimation (NPMLE), where the discrete distribution is 
interpreted as a nonpar ametric estimator of a possibly continuous latent vari¬ 
able distribution. Empirical Bayes prediction based on NPMLE has been used 
by Clayton and Kaldor (1987), Laird (1982) and Rabe-Hesketh et al. (2003a), 
among others. The latter paper found that EB predictions based on NPMLE 
outperformed predictions based on normality for a skewed latent variable dis¬ 
tribution. See Section 11.4 for an application of empirical Bayes prediction 
based on NPMLE for disease mapping. 

For a two-level random intercept model with random intercept = Cj 2 ), 

v ij = + cj 2 \ 

the empirical Bayes predictor becomes 

jr EB I Cj 2) K<,f ] ; Eli 9 (V> iva l x «, Cj- 2) ; o) dcf ) 

汹 = ( 狗 h，cf 七 dcf) • 

In the general L-level model, the empirical Bayes prediction of the latent 
variable C ⑴ at level l can be obtained recursively as 


and 




辨 (cAn^—Ddcw 
j/i(c (;) )n 5 ( ") dc (;) 


je(c w ic ( ^ +) ) /i(c (;+fe) )riff (;+fe_1) dc (/+,s) 
f dC (i+fc) , 

where k = 1,..., L—l and we have written g^> for Wy ⑴ |X •⑴ 乂⑴ +1 1+); 0) to 
simplify notation. In general, it is impossible to obtain empirical Bayes predic¬ 
tions using analytical integration. Any of the numerical integration methods 
discussed in Section 6.3 can be used, see for instance page 167 for adaptive 
quadrature. For discrete distributions, the integrals are simply replaced by 
sums. 



© 2004 by Chapman & Hall/CRC 





The ‘linear case’ : We now consider the ‘linear case，with no structural model ， 
i-e. r] z{L) =C z(<L ), specified as in (4.3), 

y z (L) = X z(L)/3 + ^z(L)Cz(L) + e z(L)- 
Here, the latent variables Cz(L) are multivariate normal with estimated co- 
variance matrix 屯 (l) an d the disturbances ^ Z (L) ar e multivariate normal with 
estimated diagonal covariance matrix & Z ( L ) with elements On (not to be con¬ 
fused with the vector 9 of all estimated parameters). 

The empirical Bayes predictor can in this case be expressed as 

Cz(L) = ^(L)^z(L)^z(L) {y z (L)-^-z(L)l^ ? (7.3) 

where 

^z(L) = ^z(L) ^ (L)^z(L) + ^z(L ), 

the estimated residual covariance structure of y 2 (L). For factor models the 
term in (7.3) is often called the factor scoring matrix for the 

regression method. 

The unconditional expectation (over ⑹） becomes 
%(Cz(L)H ⑹;沒 ）= 0’ 

because the expectation of the term in brackets in (7.3) is zero. If we condition 
on the true realized latent variables C Z (L)^ this is no longer the case with 
)C Z (L)' The conditional expectation 

E y (Cz(L)\Cz(L)^z(L) ； 0) = (^ {L) A z{L) Il z{L) A z{L) ^ Cz(L) 

= Cz ⑹ - (工 +运⑹ ^^( L )) C ( L )，（ 7 . 4 ) 

where 

^z(L) = ^z(L)®z(L)^z(L)- 

Hence, the empirical Bayes predictor is unconditionally unbiased but con¬ 
ditionally biased since the last term in (7.4) does not in general equal a zero 
vector. In the ‘linear case’，the empirical Bayes predictor is the 4 Best Linear 
Unbiased Predictor’ BLUP (Goldberger, 1962; Robinson, 1991) since it is lin¬ 
ear in y(L)，unconditionally unbiased and best in the sense that it minimizes 
the marginal sampling variance of the prediction error, if the parameters are 
treated as known (see also Lawley and Maxwell, 1971). Note that the BLUP 
concept is more general than parametric empirical Bayes in the sense that it 
does not rely on distributional assumptions (e.g. McCulloch and Searle, 2001, 
p.256). 

It turns out that many results from the statistical and psychometric litera¬ 
ture can be derived as special cases of the above formulae. For the conventional 


^y(yz(L)-^-z(L)0 I Cz(L)^z(L)] 0 )= 

given C Z (L) therefore is 
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common factor model (see Sections 3.3.2 and 3.3.3)，the empirical Bayes pre¬ 
dictor is just the regression method for factor scoring discussed by Thomson 
(1938) and Thurstone (1935) in their seminal treatments of factor analysis. 
Interestingly, the predictor proposed by Spearman (1927) for his one-factor 
model is a special case of the above, and consequently an early application of 
empirical Bayes methodology. 

For random effects models we obtain the results reported by for instance 
Rao (1975) and Strenio et al. (1983). For example, in a two-level random 
intercept model with homoscedastic level-1 variances 6u = 0, the empirical 
Bayes predictor reduces to 


c7(!)= 




^ + 0/rij \ n j 

馬⑺ （ 




:( 恥 -x ㉔) 


(7-5) 


The term in parentheses is the mean 4 raw’ or total residual for cluster j and 
Rj{ 2 ) is a shrinkage factor which pulls the empirical Bayes prediction towards 
zero, the mean of the prior distribution. The shrinkage factor can be inter¬ 
preted as the estimated reliability of the mean raw residual as a ‘measurement’ 
of Cj( 2 ) (the variance of the 4 true score，over the total variance). The reliabil¬ 
ity is smallest when rij is small and when 6 is large compared with the 
conditional density of the responses then becomes flat 

and uninformative compared with the prior density /i(Cj( 2)； ^)- 
In empirical Bayes prediction the effect of. the prior for small clusters is to 
pull the predictions x^/3 + Cj( 2 ) toward x^/3 (where all clusters contribute to 
the estimation of /3), often referred to as ‘borrowing strength’ from the other 
clusters. We will show in Section 7.6 how the concept of shrinkage also applies 
to empirical Bayes predictions for the general ‘linear case’. 

A concrete example of a three-level linear random intercept model is helpful, 
and we will repeatedly return to this example in the chapter. 

Example: A level-3 unit contains two level-2 units with two level-1 units 
within each. As shown in Display 3.2 on page 59， the structure matrix 
Ai( 3 ) (referred to as in the display) and latent variable vector 
are 


心 ⑶ = 

'10 1" 
1 0 1 

0 1 1 


「沿 1 

and 心 ⑶ = 



0 1 1 


—df - 


Assuming that the random effects covariance matrix was estimated as 


壶⑶ 


0 

0 


0 


0 


0 

0 

2 
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and the level-1 variance as 0 = 1, the empirical Bayes predictions become 

0.21 0.21 -0.12 -0.12 1 Vl1 ~ 

- 0.12 - 0.12 0.21 0.21 2/21 - 

0.18 0.18 0.18 0.18 J 2/3i - x 31 ^ 

L 2/41 - x^/3 

where the 4 weights’ have been rounded to two decimal places. Note that 
all responses for the same top-level unit provide information on all latent 
variables for that unit. The conditional expectations given the true realized 
latent variables are 

0.42 -0.24 0.18 1 
•0.24 0.42 0.18 

0.36 0.36 0.73 J 

Here the true realization of each latent variable affects the mean predictions 
of all latent variables in the same cluster. For example, a large true level-3 
random intercept will lead, on average, to larger predictions for the level-2 
random intercepts. 

It is important to note that latent scores generally cannot simply be plugged 
into non linear functions of latent variables to obtain predictions of the func¬ 
tion. For instance, in a dichotomous random intercept logit model, the proba¬ 
bility of a positive response takes the form [1 + exp(—x^/3 — Ci)] ~ 1 • To obtain 
the empirical Bayes predictor of the probability we must integrate the nonlin¬ 
ear function with respect to the posterior distribution of the latent variable 
instead of plugging in Q. 

7.3.2 Empirical Bayes variances and covariances 

We consider four types of variances and covariances of latent scores (we will 
use the term ‘covariances’ to stand for both): 

• Posterior covariances: Cov«( L )|y( L ),X( L ), 0) 

• Marginal sampling covariances: Covy(C ⑹ |X ⑹ ，沒） 

• Conditional sampling covariances: Covy (C(l) IC(_l) ， X( 【 ) ， 0) 

• Prediction error covariances (marginal): Covy«( L )_C( L )|X( L ) ， 0) 

These covariances are relevant to all scormg methods discussed in this chap¬ 
ter. We have again substituted estimates 0 for the structural parameters 0. 
Note that the posterior covariance is not fully Bayesian since it is not marginal 
with respect to a random parameter vector 0 and the sampling covariances 
are not fully frequentist since the sampling variability of 0 is ignored. 

Posterior variances and covariances 





The posterior covariance matrix is the covariance matrix of the latent vari¬ 
ables over the posterior distribution, given the observed responses and covari- 






ates. The empirical posterior covariance matrix, which we will rely on, is the 
posterior covariance matrix with parameter estimates plugged in. Confidence 
intervals based on the posterior mean and posterior standard deviation are 
analogous to Bayesian credible intervals based on approximate normality of 
the posterior. 

The empirical posterior covariance matrix is given by 

Cov (C(L)|y(i,)> x (i)；^) 

=J (C(L)-C(L)) (C(L)-C(Z,)) w(C(Z,)|y(L)) X (i)；^) dC(L)> 

which can be obtained by numerical integration as shown for adaptive quadra¬ 
ture in Section 6.3.2. 

The empirical posterior variances are biased downwards compared with the 
fully Bayesian posterior variances since the structural parameters 6 are treated 
as known. This can be seen by writing the fully Bayesian posterior variance 
matrix as 

Cov(C (L) |y ㈤ ， X (i) ) = E^ICov^^ly^)^^)；^)] 

■- + Cov 0 [E(C (i) |y (i) ,X (L) ； 0)], (7.6) 

where the first term is approximated by the empirical Bayesian posterior co- 
variance matrix. Importantly, the first term of (7.6) will dominate when the 
number of clusters is large and the number of units per cluster is small (Kass 
and Steffey, 1989). Kass and Steffey also suggest approximations for the second 
term. Assuming uniform priors for all parameters, their first-order approxi¬ 
mation (in terms of the estimated fundamental parameters 'd) is simply 

x (l); 袅 )1 

TT _! fdE(C {L) \y {L) ,X {L y,d)\ fdE(C {L) \y {L) ,X {L y,d) 

W - H 、 M 八 ^ 

where H is the Hessian of the log-likelihood at ^ and the terms in brackets 
are partial derivatives of the empirical Bayes predictions with respect to 办， 
evaluated at the estimates 办 .In linear mixed models, this approximation has 
a closed form and is discussed by Goldstein (2003, Appendix 2.2). Ten Have 
and Localio (1999) use numerical integration to evaluate this approximation 
for mixed effects logistic regression. See Section 9.5 for a comparison of fully 
Bayesian and empirical Bayesian credible intervals for study-specific treatment 
effects in meta-analysis. 

The posterior standard deviation is often used by frequentists as a standard 
error of the empirical Bayes prediction. This can be justified in the ‘linear case’ 
where the posterior standard deviation equals the sampling standard devia¬ 
tion of the prediction error (see page 234). However, the posterior standard 
deviation is also commonly used in IRT (e.g. Embretson and Reise, 2000) and 
generalized linear mixed models (e.g. Ten Have and Localio, 1999), apparently 
without any frequentist justification. 
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The ‘linear case’ : In this case the empirical posterior covariance matrix be¬ 
comes 

Cov(C^(L) \y z (L )» ^z{L) ； 0) = ^(L) - ^ (L)^z(L)^z(L)^z(L)^ (L), (7.7) 

which for the special case of the random intercept model reduces to 

Var(C j( 2)|yj(2) 5 ^(2)；^) = 7 ^ ^ = (1 - Rj{2)) ^ 

岭 + O/Uj 

As expected, the posterior variance is smaller than the prior variance due 
to the information gained regarding the random intercept by knowing the 
responses yj( 2 ). 

Example: Returning to the three-level example, the posterior covariance ma¬ 
trix becomes 

八 O. 58 

= 0.24 0.58 ， 

-0.36 -0.36 0.55 _ 

where there are two important things to note. First, even though the ran¬ 
dom intercepts of the two level-2 units are uncorrelated under the prior 
distribution, the posterior covariance is nonzero (equal to 0.24). Second, 
even though there are no cross-level covariances under the prior, the level-2 
random intercepts have nonzero covariances with the level-3 random inter¬ 
cept (equal to -0.36). 

Marginal sampling variances and covariances 

The marginal sampling covariances are the covariances of the predictions un¬ 
der repeated sampling of clusters and units within clusters, keeping both the 
covariates and the parameter estimates fixed. 

In contrast to the conditional sampling covariances discussed in the next sec¬ 
tion, the marginal sampling covariances also reflect variability due to sampling 
of the latent variables from their prior distribution. The marginal sampling 
standard deviation can therefore be used for detecting clusters that appear 
inconsistent with the model. For this reason, Goldstein (2003) refers to this 
quantity as the ‘diagnostic standard error’. See Section 8.6.2 for further dis¬ 
cussion. 

The marginal sampling covariance matrix of the empirical Bayes predictor 
is 

Cov y (C(!)|X (l)； ?) = Cov y [E(C (L) |y (L) ,X (L) ； g)] 

= J C(l)C(l) 9^ L \y0) dy( L ), 

where p( L )(y(L) |X(l); 沒 ） is the joint marginal distribution of the responses for 
the top-level cluster. Note that we have dropped the : subscript to simplify 
notation and will continue to do so in the remainder of this chapter. 
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There is unfortunately no closed form expression for the general model. 
However, Skrondal (1996) suggested using the relation 

Cov(C (L) |X( L) ;0) = E y [Cov(C (L) |y (L ),X (L) ;0)] 

+ CoVy[E(C( L )|y ⑹， X( L );0)] ， 

to derive an approximate expression. Recognizing that the last term is the re¬ 
quired covariance matrix of the empirical Bayes predictor, this can be rewrit¬ 
ten as 

Cov y (c^)|X (i) ； e) = Cov(C (i) |X (i)； ?) -E y [Cov(C( L )|y (L ),X (i) ； e)]. 

The first term on the right-hand-side is the estimated prior covariance matrix 
^ of the vector of latent variables C(l) and the second term is the expectation 
of the posterior covariance matrix which can be approximated by the posterior 
covariance matrix. Therefore we obtain 

Covy (C( L )|X( L ); 0) « 屯 ( L ) _ Cov(C( L )|y( L ) ， X( L ); 0). (7.8) 

An alternative approximation would be via simulation, first sampling the 
latent variables from the prior distribution and then the responses from their 
conditional distribution given the latent variables and covariates. An advan¬ 
tage of this approach is that uncertainty in the parameter estimates 0 is easily 
accommodated by drawing new samples from their sampling distribution be¬ 
fore sampling the latent variables as suggested by Longford (2001) in a related 
context. 

The ‘linear case’ : In this case, (7.8) holds perfectly and the marginal sam¬ 
pling variance becomes 

Cov y (C^)|X (L ) ； 0) = $ (L) -Cov(C (z ,)|y( L ),X (L) ； 0) 



The diagonal elements of this covariance matrix are clearly smaller than those 
of the prior covariance matrix since the posterior variances are positive. This 
is due to shrinkage. 

Shrinkage has led some researchers (e.g. Louis, 1984) to suggest adjusted em¬ 
pirical Bayes predictors with the same covariances as the prior distribution. 
This predictor minimizes the posterior expectation of the summed quadratic 
loss function (for given parameter estimates) in (7.2) subject to the side condi¬ 
tion that the predictions satisfy the estimated first and second order moments 
of the prior distribution. In the factor analysis literature, the idea of obtaining 
factor scores with the same covariance matrix as the prior distribution dates 
back to Anderson and Rubin (1956), who considered models with orthonormal 
factors. This 4 covariance preserving’ approach has been extended to general 
prior covariance matrices (e.g. Ten Berge, 1983; Ten Berge et al” 1999). 

For a simple random intercept model, the marginal sampling variance re- 
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duces to 


Vary ( 頌 )|X j(2 );g )= 夺二几 ^ = Rj(2) $■ 

Example： Returning to the example of a three-level random intercept model, 
the marginal sampling covariance matrix of the random effects of the level-3 
unit becomes 

〜 eb 「 O. 42 

Covy (Ci(3)|Xi( 3 )；^) = -0.24 0.42 . 

0.36 0.36 1.45 

Note that there are nonzero covariances between the level-2 random inter¬ 
cepts for different level-2 units as for the posterior covariance matrix. Also 
observe that there are again nonzero covariances across levels. 

Nonzero sampling covariances across levels complicate diagnostics in multi¬ 
level models considerably. Unfortunately, this problem has often been over¬ 
looked in the multilevel literature, for example by Langford and Lewis (1998) 
and Goldstein (2003). 

Conditional sampling variances and covariances 

The conditional sampling covariances are the covariances of the predictions 
under repeated sampling of units from the same cluster with fixed ‘true’ latent 
variables C(l) (in addition to fixed covariates and parameter estimates). 

Note that the conditional sampling standard deviation should not be con¬ 
fused with the Comparative standard error’ used by Goldstein (2003, p.23). 
Goldstein describes this as being conditional on the true latent variables but 
it is actually the marginal (over sampling variance of the prediction errors, 
which we will show to be equal to the posterior variance in the ‘linear case’ 
on page 235. 

The conditional sampling covariance matrix of the empirical Bayes predic- 
〜 EB 

tors C(l) ，given the latent variables, is 

Covy (C(l)|C(_l) ， x (l); 沒） =Covy[E(C ⑹ |y ⑹， X( L ); 0)|C(L)] 

‘ j Ccl)Ccl) n 5 (1) {Vij-z\C{L) > X (L) ； 

where the product represents the joint conditional distribution of the responses 
to all level-1 units within the level-L unit. 

The ‘linear case’ : Here, the conditional sampling covariance matrix becomes 
Covy (C( l )IC(l) ， X(l);^) = 屯⑹ A( L )S( L )®( L )S( L )A ⑹ #( 石 ) ， (7.10) 
which for the special case of the random intercept model can be expressed as 

Var y (^ E (2)I0(2), X j(2) ; ?)= 馬 (2) (1 - R j{ 2 )) 
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Example： In the three-level random intercept model, the conditional sam¬ 
pling covariance matrix for the random effects of the level-3 unit becomes 

CoVy ^Cl(3)lCl(3)? X l(3 )； = 

As expected, the variances are considerably lower than the marginal sam¬ 
pling variances. Note also the difference between the conditional sampling 
covariance matrix and the corresponding posterior covariance matrix in this 
example as these variances are often confused. 

Prediction error variances and covariances 

The prediction error covariances are the covariances of the prediction errors 
—C(l) under repeated sampling of the responses from their marginal dis¬ 
tribution, 

C ov y (C(l) ~~C(l)I x (l) ；^ ) 

= J (C(l) - C(z,)) (C(l) - C ㈤) p (i) (y(L)l x (L)；^) dy(L)- 

The covariances are marginal since they reflect variability due to sampling 
of the latent variables as well as the sampling of responses given the latent 
variables. The standard deviation of the prediction errors is perhaps the most 
obvious measure of prediction uncertainty. 

Using a multivariate generalization of a derivation in Waclawiw and Liang 
(1994), the prediction error covariance matrix can be expressed as (omitting 
the conditioning on X ⑹ and 0 for brevity) 

Covy (d C(L)^ 

= E y[(c^) - C(L)) (q!) - C(I,)) ] - Ey(c^> - C(L)) Ey(c^) 

= E y[( E (C(i)|y(i)) - C(L)) ( E (C(L)|y(z,)) -(⑹)'] 

- Ey{Ey[(E(C (i) |y^-C w ) (E(C (i) |y ( #-C w ) ， | y w ]} 

=Ey[Cov(C( L )|y (Z/ ))], 

the expectation of the posterior covariance matrix over the marginal sampling 
distribution. In the above derivation, the second equality relies on uncon¬ 
ditional unbiasedness of the empirical Bayes predictor, the fourth equality 
exploits the double expectation rule and the final equality simply recognizes 
that the term in square brackets represents the posterior covariance matrix. 



0.12 

—0.10 0.12 
0.03 0.03 0.13 
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The above expression suggests the following approximation: 

Covy ^C(l) _ C(l)I x (l) ；^) ~ Cov (<( L )|y ⑹, X( L );0) . (7.11) 

An alternative approximation would be to use simulation as for the marginal 
sampling covariances: Draw latent variables from their prior and subsequently 
responses from their conditional distribution given the latent variables. The 
‘true’ latent variable realizations are then just the simulated ones and sub¬ 
tracting these from the empirical Bayes predictions, we can estimate the pre¬ 
diction error variances. To reflect the imprecision of the parameter estimates, 
the above could be preceded by sampling the parameters from their estimated 
sampling distribution. 

The ‘linear case’ : The posterior covariance matrix for the ‘linear case，in (7.7) 
does not involve the responses so that the approximation in (7.11) be¬ 
comes exact in this case, 

c ° v y (Ccl)-C(z,)|X(l) ；@) = ^(L)- ^(L)X(L) S ；( 1 I,)A Z (I,)$(Z,) 

=Cov(C (I/) |y (I/) ,X (i) ,0), 

i.e.，the prediction error covariances are just the posterior covariances given 
in equation (7.7) on page 231. 


7.4 Empirical Bayes modal (EBM) 

7.4-1 Empirical Bayes modal prediction 

Instead of using the posterior mean as in empirical Bayes prediction, we could 
use the posterior mode. It can be shown that the posterior mode minimizes 
the posterior expectation of the zero-one loss function: 

J 0 if |C(L) - C(L)I ^ € 

\ 1 if IC(L) - C(_l)I > e ， 

where € is a vector of minute numbers such that -^ BM (C(l)? C(l)) i s zero when 
C ⑹ is in the close vicinity of C(l) and one otherwise. Unlike the mean squared 
error loss function underlying empirical Bayes, the above loss function is also 
meaningful for latent classes since the loss will be 1 whenever the predicted 
latent class is not the true latent class and zero otherwise. The loss function 
is in this case simply thejiumber of misclassifications. 

Plugging in estimates 0 for 0, we obtain 

☆ B ) M 爹 X7 S 冰 ㈤ iy ㈤, x ㈤ 也 

We suggest denoting this predictor the empirical Bayes modal (EBM). Some 
authors use the terms Bayes modal (e.g. Samejima, 1969) or ‘modal a poste¬ 
riori (MAP) estimators’ (e.g. Bock and Aitkin, 1981; Bock and Mislevy, 1982; 
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Bock, 1983, 1985), which do not explicitly acknowledge that estimates have 
been plugged in. 

Empirical Bayes modal is the standard classification method in latent class 
modeling. An alternative loss-function assigns different weights to different 
misclassifications reflecting the different costs incurred. The Bayes risk crite¬ 
rion then uses the classification that minimizes the expected cost. Proportional 
prediction, on the other hand, randomly assigns classes to clusters according 
to their posterior probabilities. Clogg et al. (1991) suggest using this method 
for imputing latent classes in multiple imputation. 

Generally, there is no analytical expression for the empirical Bayes modal 
predictor and we must resort to numerical methods. The posterior mode is 
the solution of 

ln w(Ccl) |y (l) ， x (l); 沒 ) = o ， 

(provided second order conditions are fulfilled) which can be obtained using 
gradient methods (see Section 6.4.2)，for instance the Newton-Raphson algo¬ 
rithm. 


expressions, 

In h(C^ L y,0) + ln II 9^ C ( l )； = 0 ( 7 .12) 

In contrast to empirical Bayes, this method does not require numerical integra¬ 
tion. For this reason, empirical Bayes modal is often used as an approximation 
to empirical Bayes when the posterior density is approximately multivariate 
normal. As pointed out in the previous chapter, Lindley and Smith (1972) 
suggest using Bayes modal as an approximation to the Bayes predictor in the 
truly Bayesian setting. Using this method corresponds to maximizing the h- 
likelihood with respect to the latent variables for given parameter values (see 
page 164). 

Since integration is avoided, empirical Bayes modal is more computation¬ 
ally efficient than empirical Bayes for latent variable models with non-normal 
responses. Samejima (1969) therefore introduced empirical Bayes modal in the 
context of her 4 graded response model’，an ordinal probit or logit one-factor 
model with normally distributed factor, and provided a rigorous derivation 
of its properties. Interestingly, Samejima also reported the somewhat surpris¬ 
ing result that EB and EBM predictions were virtually indistinguishable for 
a model with just six dichotomous responses. Muthen (1977) subsequently 
extended Samejima’s approach to probit multiple-factor models with dichoto¬ 
mous responses. Generalizing the results of Samejima and Muthen, Skrondal 
(1996, Chapter 7) discussed empirical Bayes modal prediction for a general 
class of multidimensional latent variable models with multivariate normal la¬ 
tent responses. 


Since the denominator of the posterior distribution does not depend on C(l )， 
seen in equation (7.1), we can use the numerator in place of cu in the above 
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The ‘linear case’ : Here, the posterior is multivariate normal so that the ex¬ 
pectation equals the mode. Hence, the empirical Bayes and empirical Bayes 
modal predictors coincide in this case, see equation (7.3) on page 227. 


7.4-2 Empirical Bayes modal covariances and classification error 
Empirical Bayes modal variances and covariances 

For general latent variable models we are not aware of methods for deriving 
the different types of sampling (co)variances for empirical Bayes modal, apart 
from simulation. However, for large clusters where the posterior approaches 
normality and the posterior mean is close to the mode, the empirical Bayes 
sampling covariances should be good approximations. In the 4 linear case’ the 
empirical Bayes covariances equal the empirical Bayes modal covariances; see 
equation (7.7) on page 231 for the prediction error covariances, equation (7.9) 
on page 232 for the marginal sampling covariances, and equation (7.10) on 
page 233 for the conditional sampling covariances. 

Having used gradient methods to find the posterior mode, it is natural to 
use the negative inverse of the Hessian at the mode as an approximation to 
the posterior covariance matrix, 

^ ov (Cz(L)\y Z (L)^z(Ly^) ~ _ (^^i n w(C(L)|y(L) ， x (L); 沒 )). 

The approximation becomes exact as the posterior approaches a multivariate 
normal, i.e. as the cluster size increases. 

Classification error 

For discrete latent variables, the quality of the classification of a given cluster 
can be assessed using the estimated conditional probability of misclassification 
given yj and Xj, 

fj = 1 - a;(e c | yj .,X j； 0). (7.13) 

The overall misclassification rate can be estimated by the sample mean of fj 
over clusters. This is the basis of the proportional reduction of classification 
error criterion which compares this misclassification rate with the rate when 
Yj and X) are not available (see Sections 9.3 and 13.5 for applications). 

7.5 Maximum likelihood 

7.5.1 Latent score estimation 

Latent variables are sometimes taken to be nonrandom or fixed. In this situ¬ 
ation it is natural to interpret the latent scores as unknown parameters to be 
estimated. In Section 6.10.1 we considered joint estimation of model parame¬ 
ters and latent variables. In contrast, we now assume that the model param¬ 
eters have been estimated (using one of the estimation methods discussed in 
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the previous chapter) and consider the estimation of the latent variables C for 
given 0. 

The estimation approach to scoring or classification is based on the condi¬ 
tional distribution of the responses, given the latent variables, with the esti¬ 
mates 0 of the model parameters plugged in, 

9^ {yij...zCz{L)^ ^)- 

This conditional distribution is interpreted as a ‘likelihood’ with the values of 
the latent variables for the cluster as unknown parameters. 

Analogously to maximum likelihood estimation of model parameters, the 
conditional distribution is maximized with respect to the unknown latent vari¬ 
ables (parameters) by solving the likelihood equations 

g, C Z ( L y^) = o 

It is interesting to note that this corresponds to maximizing the second term 
of (7.12) for empirical Bayes modal. In empirical Bayes modal the log prior 
density may hence be regarded as a penalty term for deviations from the prior 
mode. The empirical Bayes modal predictions are therefore shrunken towards 
the prior mode relative to the maximum likelihood estimates. 

As would be expected, the estimates for a cluster are asymptotically un¬ 
biased as the number of units in the cluster tends to infinity. However, this 
result may not be useful since the number of units in the cluster is often small, 
for instance in longitudinal or family studies. For the Rasch model Hoijtink 
and Boomsma (1995) give an overview of the literature on the finite sample 
properties of maximum likelihood, empirical Bayes modal and the weighted 
maximum likelihood estimator suggested by Warm (1989). 

Special problems arise for clusters with sparse information because the prior 
distribution of the latent variables is not utilized. For example, consider a uni¬ 
dimensional factor model with dichotomous responses. If all the responses for 
a given cluster (typically subject) are zero, the likelihood contribution for 
that cluster does not have a maximum and the factor score would have to be 
—oo to satisfy the likelihood equation (Samejima, 1969). Another example is 
a growth curve model with a random intercept and slope where the slope for 
a cluster cannot be estimated if only one response is observed for the clus¬ 
ter. Neither example would pose any problems for empirical Bayes prediction 
which utilizes the prior distribution. The first example benefits from shrink¬ 
age, pulling the prediction away from — oo, whereas the second benefits from 
the information in the random intercept through its posterior covariance with 
the random slope. 

Furthermore, maximum likelihood estimation of the latent variables requires 
that the latent variables are considered fixed parameters. This is inconsistent 
with our model framework and the marginal maximum likelihood method 
of parameter estimation. We would in general not recommend the maximum 
likelihood scoring method for these reasons. However, the maximum likelihood 
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method may be useful for assessing the normality assumption for the latent 
variables (see Section 8.6.2). 

The ‘linear case’ •• Solving the likelihood equations for <( L ) gives 

~ML /八 f 八 一 1 ^ \ 一 1 a ! 八 一 1 / 八'\ 

C(L) = ( A (Z,)®(L) A (L)J A (L)®(L) \y(L) - x (L) 3 j - ( 7 _ 14 ) 

It follows that 

E y(;(L)lC(L) ， X (L); 谷） =C(L )， 

so unlike the empirical Bayes and Bayes modal predictors, the maximum like¬ 
lihood estimator is conditionally unbiased, given the values of the latent vari- 
ables C(l). 

For common factor models (see Sections 3.3.2 and 3.3.3) maximum likeli¬ 
hood corresponds to the Bartlett factor scoring method (Bartlett, 1937; 1938). 
That this method can be interpreted as a maximum likelihood estimator for 
the 4 linear case’ is not transparent in the conventional treatments, where the 
Bartlett method is derived as either the minimizer of the sum of squares of 
the standardized residuals (Anderson and Rubin, 1956; Lawley and Maxwell, 
1971) or as the minimizer of summed quadratic loss among conditionally unbi¬ 
ased estimators (Lawley and Maxwell, 1971). On the other hand, these deriva¬ 
tions demonstrate that the Bartlett method can be motivated without making 
distributional assumptions. 

In random effects modeling, this estimator is also known as the ordinary 
least squares (OLS) estimator of the random effects. In the special case of a 
two-level random intercept model, the estimator is just the mean raw residual 

rij 

c") = +E(L. 

~ i=i 

Note that multilevel models with fixed intercepts at several levels are not 
identified unless constraints are imposed, for instance that the sum of the level- 
two intercepts within the same level-three cluster add to zero. This implies 
that maximum likelihood estimates do not exist for the running example in 
this chapter since the matrix (A( L )©( L )A(^)) _1 in equation (7.14) is singular. 


7.5.2 Variances and covariances of ML estimator 


Using likelihood theory, the asymptotic covariance matrix of the ML estimator 
C(l) becomes 

Covy (C ⑹ ⑹， X( L );0) « - ( ln II 9 ^ {yij...z C(L) ； • 


Interestingly, Covy (C (^) IC(l) ? ^-(l) ； can be interpreted as the conditional 
sampling covariance matrix of the scores, given the true latent variable C Z (L)- 
The asymptotics require that the number of units (or items) in a cluster 
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tends to infinity, and not the number of clusters. Thus, the utility of this result 
may be questionable in practice, where there are often few units in a cluster. 

For a given cluster, the inverse of the above covariance matrix, the observed 
information, is the sum of contributions —8 2 ln 分 “)(2 /勿 …… C(l)5 ^)/^C(l) 
from the individual units. In unidimensional IRT (C(l ) 三 Cj), these contribu¬ 
tions from individual items, plotted against (j, are known as 4 item information 
curves’，and the sum of the contributions over the items is called the ‘test in- 
formation’ (e.g. Birnbaum, 1968). Item information functions can be used to 
assess how much 4 information，is gained about an individual’s ability from 
knowing her response to an item as a function of her true ability. Information 
functions play a central role in IRT and are helpful in test construction, item 
selection, assessment of precision of measurement, comparison of tests, com¬ 
parison of scoring methods, and tailored or adaptive testing (e.g. Hambleton 
and Swaminathan, 1985). For instance，in adaptive testing the information 
function can be useful for choosing the next item to present to an examinee 
given the current estimate of his ability. The item is noninformative if it is too 
simple or too difficult for the examinee, making the response too predictable. 

Observe that the covariance matrix of the maximum likelihood estimator 
tends to the posterior covariance matrix as the cluster size tends to infinity - 
the likelihood swamping the prior (e.g. DeGroot，1970). 

The marginal sampling covariance matrix is 

Cov y (cfl)|X( L) ； e) = Cov c [E y (cfl)|C (i ),X (i ) ； 0)] 

+ E^I^Covy ^C(L)IC(L)> x (i)；^)] - 

The ‘linear case’ : Here, the marginal sampling covariance matrix of the max¬ 
imum likelihood estimator becomes 

Cov y (cJ)|X (i )；0) = (7.15) 

and the conditional sampling covariance matrix is 
Covy (c^)K (i ),X (i)； ?) = = Covy 

the same as the unconditional prediction error variance. 

For a linear random intercept model, the marginal sampling variance is 

Vax y (碟 |X ⑵; §) = ^e/nj, 

and the conditional variance and prediction error variance are simply 
Vax y (碟 |C( 2 )，X ⑵; §) = Co Vy (0 2 ) L -C( 2 )|X( 2 );§)= 9/nj. 
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7.6 Relating the scoring methods in the 6 linear case 9 


We are now in a position to present a very instructive expression which relates 
the empirical Bayes, empirical Bayes modal and maximum likelihood methods 
in the linear case’. 

The marginal sampling covariance matrix of the maximum likelihood pre¬ 
dictor in (7.15) can be written as 

Cov y (C(L) I X (L)；^) = ^(L) + A( L ), 

where 

\ l ) = (-^(L)©(L)A(I,)) ， 

and ^(l) represent the intra-cluster and inter-cluster contributions to the 
covariance matrix, respectively. The multivariate reliability of the maximum 
likelihood estimator can then be defined as 

S ( L ) 三 C0Vy(C(L)) [C0Vy (C(L))] = 运⑹(运⑹ + ^ ■⑹) . 

Using the same line of proof as in Bock (1983), the following identity can 
be demonstrated: 

〜 EBM 

This identity is a multivariate representation of the phenomenon dubbed 
shrinkage in the statistical literature (e.g. James and Stein, 1961). We note 
that the empirical Bayes predictor is pulled toward the prior expectation 0 of 
the latent variables whenever the estimated level-1 variation 0 is large rel¬ 
ative to the estimated inter-cluster variation 屯 (l). On the other hand, the 
empirical Bayes predictor is pulled toward the maximum likelihood estimator 
when the inter-cluster variation is large compared to the intra-cluster vari¬ 
ation (for instance due to large cluster sizes). In the limit, where R(l) = I ， 
we obtain <巧 ) =C^l) M =C(l) 5 all three methodologies coincide. Of course, all 
these results are in perfect accordance with our intuition regarding a sensible 
latent scoring methodology. 

For a random intercept model, we encountered this reliability in (7.5) with 

r (2) = 

u 0 + 0/rij 

Note that Ri( 3) is not defined for the three-level numerical example since 
Ai(3) is singular for any higher-level model. 



7.7 Ad hoc scoring methods 

For hypothetical constructs common in the social and behavioral sciences (see 
Section 1.3)，scores are often assigned by ad hoc methods such as simply 
summing the values of responses from a number of indicators or items. An 
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important assumption underlying these methods is that the items contribut¬ 
ing to the score measure a unidimensional construct. For multidimensional 
constructs, the methods are sometimes applied to subsets of items, producing 
4 subscales’. 


7.7.1 Raw Sums cores 

The most common ad hoc approach for continuous, ordinal and dichotomous 
responses is undoubtedly the use of raw or unweighted sumscores, defined as 

n 3 

y.j = . 

i=l 

The resultant score is denoted a Likert scale (Likert, 1932) when the responses 
are ordinal and given successive integer codes. 

In psychology, psychiatry and related fields, measurement scales based on 
questionnaires or structured interviews are usually defined as raw sum scores 
of dichotomous or ordinal responses. Researchers using such Validated instru- 
ments’ are expected to adhere to the scoring method described in the manual 
accompanying the questionnaire. This practice effectively discourages serious 
measurement modeling. 

In some cases the sumscore method can be given a theoretical justifica¬ 
tion. For continuous responses, the raw sumscore methodology is equivalent 
to maximum likelihood estimation of the scores (i.e. Bartlett method) for uni¬ 
dimensional parallel measurement models (see Section 3.3.2), where all factor 
loadings are equal and all unique factor variances are equal (e.g. Joreskog, 
1971b; Maxwell, 1971). For dichotomous responses the raw sumscore forms 
a sufficient statistic for estimating the 4 ability’ in the Rasch model discussed 
in Section 3.3.4. However, both the unidimensional parallel measurement and 
Rasch models are very restrictive models and unlikely to hold in practice. 

In general, use of raw sumscores as a measurement strategy cannot be given 
a theoretical motivation and usually implies a rejection of measurement mod¬ 
eling. Torgerson (1958) therefore describes the sumscore strategy as an exam¬ 
ple of ‘measurement by fiat’，in contrast to the more respectable ‘fundamental 
measurement’ obtained from measurement modeling. In practice, some form 
of modeling is often employed to justify the use of sum scores. For example, 
unidimensionality is typically investigated through the use of factor modeling. 
It appears somewhat inconsistent to use modeling arguments as a justification 
for the use of an ad hoc scoring method. 

A standard argument in favor of the raw sumscore methodology is that it 
has been demonstrated repeatedly that the Pearson correlation between the 
sumscore and scores from more sophisticated methodologies often approaches 
one, especially when the number of variables is relatively large. This holds for 
continuous items (see Wilks, 1938; Gulliksen, 1950; Wang and Stanley, 1970; 
Wainer, 1976) as well as for dichotomous items (see Muthen, 1977; Kim and 
Rabjohn, 1978). The bright side of these results is that an extremely simple 
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approach appears to work as well as much more cumbersome methodologies. 
However, the use of the high Pearson correlations as evidence has been crit¬ 
icized. The point is that differences on specific parts of the latent scale of 
special interest, for instance a cut-off point for admission in an ability test, 
can be masked by the use of a summary statistic such as the correlation (e.g. 
Hambleton and Swaminathan ， 1985). 

A major limitation of the sumscore approach is that it cannot be directly 
applied to clusters with missing items. An ad hoc strategy in this case is 
to impute the responses for the missing items using the cluster mean of the 
nonmissing units. Finally, the sumscore methodology cannot incorporate co¬ 
variate information or relationships (regressions or covariances) between latent 
variables in contrast to the model-based approach. 


1.1.2 Other methods 

One version of the raw sumscore strategy is to discard 4 nonsalient’ items from 
the sumscore (see e.g. Thurstone, 1947; Gorsuch, 1983). A basic problem with 
this variant is that the employed definitions of salience are arbitrary. 

The representative item strategy discards all but one particular item, pre¬ 
sumably easily measured and valid, and takes the response for that item as 
the latent score (e.g. Rummel, 1967). This approach is often used in quality 
of life questionnaires where one item asks directly about quality of life. Here 
the answer to that single question is often used as the gold standard with 
which scores derived from the remaining items are validated. Obviously, the 
representative item strategy amounts to wasting information and presupposes 
data of extremely high quality (e.g. Adams, 1975). 

In the factor loadings as weights strategy the scores are obtained by using 
a weighted sumscore using the factor loadings as weights (e.g. Fruchter, 1954; 
Blalock, 1960). One problem with this strategy is that it lacks a theoretical 
rationale. Another problem is that the items are individually credited with an 
influence that they share with other items, yielding redundant solutions (see 
also Glass and Maguire, 1966; Harris, 1967; Halperin, 1976). 

Finally, linear case’ factor scoring methods using the factor scoring matrix 
are sometimes applied to noncontinuous responses, an ad hoc strategy that 
should be avoided. 


7.8 Some uses of latent scoring and classification 

7.8.1 Introduction 

There are many applications of latent scoring and classification including mea¬ 
surement, ability scoring, disease mapping, small area estimation, medical 
diagnosis, image analysis and model diagnostics. For continuous latent vari¬ 
ables we can distinguish between two kinds of latent scores. Factor scores for 
measurement are discussed in Section 7.8.2 and random effects scores in Sec¬ 
tion 7.8.3. In Section 7.8.4 we briefly discuss classification and in Section 7.8.5 
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we point out that latent scoring is useful for model diagnostics, a topic we 
return to in more detail in Section 8.6.2 of the next chapter. 


7.8.2 Factor scoring as measurement proper 

A conventional definition of measurement is due to Stevens (1951), who de¬ 
fined measurement as merely the assignment of a number to an attribute 
according to a rule. This definition is extremely broad, and in our opinion not 
particularly fruitful. Many phenomena of interest are best construed as latent 
variables or factors and it is natural to focus on factor scoring. 

Clogg (1988) defines the measurement process culminating in measurement 
as embodying the following steps: 

1. Selection of items 

2. Specification of tentative latent trait models 

3. Choice of retained latent trait model 

4. Estimation and interpretation of model parameters 

5. Measurement of latent traits. 

Much has been written in the psychometric and statistical literature regard¬ 
ing the first four steps of the measurement process. Somewhat surprisingly, 
and not without irony, most treatments of measurement modeling stop short 
of measurement. Consider for instance the chapter denoted ‘Measurement’ 
by Bohrnstedt (1983). Although giving a nice introduction to measurement 
models, it fails to address precisely what is expected from the title, namely 
measurement per se. 

At the other extreme, the purpose of latent variable modeling sometimes 
is to derive scoring procedures (e.g. Gorsuch, 1983). The scoring procedures 
or keys are subsequently employed in other samples, making latent variable 
modeling superfluous in future research. Unfortunately, the entire strategy 
of generalizing scoring procedures across populations is highly problematic. 
In particular, results from the theory of factorial invariance (e.g. Skrondal 
and Laake, 1999; Meredith, 1964, 1993) suggest that scoring weights are not 
expected to be invariant across populations. Thus, we recommend that the 
model parameters are estimated or ‘calibrated’ on the same population for 
which latent scores are desired, possibly by using multi-group modeling. 

Apart from the need for measurement, there are several other motivations 
for obtaining factor scores. Factor scores can help in model interpretation. 
In particular, plots of latent scores often prove useful in discussing issues 
of dimensionality in factor models (e.g. McDonald, 1967; Etezadi-Amoli and 
McDonald, 1983), see Figures 10.5 and 10.7 for applications. Factor scores 
can also be useful in classification of units. Examples include the admission 
of students according to ability tests and treatment of patients according to 
mental health tests (e.g. Duncan-Jones et al” 1986; Muthen, 1989a). However, 
for this purpose it appears more natural to use a discrete latent variable and 
classify units using the posterior probabilities. 
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Factor scores are also useful in tailored or adaptive testing. Here the scores 
are updated sequentially as new item responses are obtained. The ‘current ， 
score determines the choice of subsequent items in order to maximize the 
obtained c information J (e.g. Bock and Mislevy, 1982). 

Conventionally, factor scores have served as vehicles for further analysis. 
Specifically，factor scores are often used to impute latent variables in structural 
equation models. This use has been hailed as one of the major motivations for 
obtaining factor scores by a number of authors (e.g. Kim and Mueller, 1978; 
Gorsuch, 1983; Johnson and Wichern, 1983). Kim and Mueller (1978, p.60) 
stated that 

“In fact, with the exception of the psychometric literature, factor analysis seems 
to have been used more often as a means of creating factor scales for other studies 
than as a means of studying factor structures per se” • 

This statement is also apt for the present situation as is apparent in some 
software packages where factor analysis is one of the options in the 4 data re¬ 
duction 5 menu, another option being principal component analysis. In view of 
the frequent use of factor scores as vehicles for further analysis, it is impor¬ 
tant to point out that this approach can be problematic. Biased estimates of 
model parameters will result unless care is exercised in obtaining the scores 
and standard errors are underestimated (Skrondal and Laake, 2001). 

The modern approach to estimating latent variable models is to estimate 
the model parameters directly, without resorting to imputed latent variables. 
This fact may explain the remarkable paucity of research on scoring for latent 
variable models. 


7.8.3 Random effects scores as cluster-specific effects 

In random effects models, individual clusters are construed as having their 
own regression 4 parameters’, sampled from some distribution. Consequently, 
random effects scores represent an assessment of the cluster specific effects of 
explanatory variables. Obviously, such effects are often of considerable sub¬ 
stantive interest. Random effects scores are extremely useful in growth curve 
or development modeling. In this case the scores form the basis for plotting 
growth trajectories for the individual clusters or groups of clusters (e.g. Strenio 
et al., 1983). An application for epileptic seizures is given in Figure 11.1. 

Another application of random effects scores is classification and ranking 
of clusters. Examples include the classification of organizations as more or 
less effective according to their random effects scores (Aitkin and Longford, 
1986)，ranking of different industries in terms of the gender gap in earnings 
(Kreft and de Leeuw, 1994) and ranking of schools in terms of exam per¬ 
formance (Goldstein and Spiegelhalter, 1996). An appropriate standard error 
for comparing the random effects of two units is the standard deviation of 
the prediction error. Rankings can however be extremely variable and their 
precision is not easily expressed in terms of standard errors (Goldstein and 
Spiegelhalter, 1996). 
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Random effects models are often used to combine effect size estimates from 
different studies in a meta-analysis. In this case, it is natural to replace the 
original effect size estimates with the posterior means of the random effects 
model to ‘borrow strength’ from other studies. Posterior standard deviations 
are in this case typically used to represent the confidence intervals which also 
represent the prediction error variances in the continuous case. In Section 9.5 
we use this approach in a meta-analysis of nicotine gum for smoking cessation. 

Random effects scoring is popular for disease mapping and small area esti¬ 
mation. In regions with small populations, the raw incidence estimates can be 
very imprecise and the resulting map 4 noisy’. Borrowing strength from other 
regions can result in more reliable and smoother maps. Ideally, the models 
should in this case exploit spatial information; see Clayton and Kaldor (1987), 
Langford et al. (1999) and Section 11.4. 


7.8,4 Classification 

Sometimes a latent variable is inherently discrete, the canonical example be¬ 
ing medical diagnosis where a patient either has a particular illness or not. 
In this case classification is crucial for prescribing the correct treatment. In 
marketing, customers are sometimes classified as belonging to one of several 
‘market segments’，characterized by specific sets of preferences, for the purpose 
of targeted advertising (e.g. Wedel and Kamakura, 2000). 

Often the latent variable may be best perceived as continuous, but clas¬ 
sification is required to make a decision. In this case it might be preferable 
to specify a discrete latent variable model (such as a latent class model) • We 
can then use empirical Bayes modal to classify the units instead of applying 
arbitrary thresholds to a continuous score. In education, this approach has 
been used when mastery of a subject is of interest rather than ability on a 
continuous scale (e.g. Bergan, 1988; MacReady and Dayton, 1992). 

A common problem in image analysis is image segmentation or restoration 
where pixels (picture elements on a square grid) or voxels (volume elements on 
a cubic grid) are classified as belonging to one of several regions. An example 
in brain imaging is delineating a brain tumor. Here, spatial models such as 
Markov random fields (e.g. Besag, 1986) are sometimes specified for the latent 
region labels or latent classes, and conditionally on the latent classes the 
responses are normally distributed representing 4 noise 5 . Segmentation is then 
achieved by finding the modal a posteriori region labels using for instance 
‘simulated annealing ， (e.g. Geman and Geman, 1984). 

In this book we mostly consider latent class models for classification when 
the true classification is not known for any of the units. If the true classifi¬ 
cation is known for a subsample of units, sometimes called the ‘training set ， 
or Validation sample’，latent class models can be extended as shown in Sec¬ 
tion 14.3 in the context of covariate misclassification. Wedel (2002a) discusses 
the problem where ‘core variables，(the responses) and ‘concomitant variables ， 
(explanatory variables) are observed for the training set, but only concomi- 
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tant variables are observed for new units to be classified. A different problem 
is prediction of a single categorical response variable y, where the discrete 
latent variables serve as a ‘hidden layer，of intervening variables as in neural 
networks (see Vermunt and Magidson (2003b) for an overview of such models). 

As in latent scoring, latent classifications are sometimes used as ‘observed’ 
variables in subsequent analyses. A related approach mentioned by Wedel 
(2002b) is to regress estimated posterior probabilities on explanatory vari¬ 
ables. Both these ad hoc approaches are problematic and should be avoided 
(e.g. Croon, 2002). As for continuous latent variables, it is preferable to model 
the relationship between latent classes and observed responses and/or covari¬ 
ates directly. In Section 14.3 we estimate models with latent classes as covari¬ 
ates and in Section 13.5 latent classes are regressed on covariates. Note that 
prediction of latent class membership in this case uses information from all 
variables included in the model and not only the items ‘measuring，the latent 
classes. 

7.8.5 Model diagnostics 

An important application of latent scoring is model diagnostics. It might be 
tempting to use latent scores to study the assumptions underlying statistical 
models in much the same way as if observed variables were investigated. How¬ 
ever, it should be remembered that the distribution of the predicted scores 
is not the same as the theoretical distribution of the latent variables, mak¬ 
ing this approach problematic, particularly for nonnormal responses. See also 
Section 8.6.2. 

Latent scores can be treated as estimated residuals for outlier detection, for 
instance by comparing the scores with their approximate sampling standard 
deviation. The use of latent scoring in diagnostics for general latent variable 
models will be discussed in Section 8.6.2; see Section 11.3.3 for a concrete 
application. 


7.9 Summary and further reading 

For all the model based scoring and classification methods discussed in this 
chapter, missing responses do not pose any problems as long as they can be 
assumed to be missing at random (MAR). 

By far the most common approach for assigning values to continuous la¬ 
tent variables is empirical Bayes prediction. The advantage of this approach 
is that the predictions minimize a mean squared error loss function (if the 
model parameters are assumed known) and are unconditionally unbiased. The 
conditional bias or shrinkage associated with the method can be seen as an 
advantage when sparse information is available on some units. 

The maximum likelihood method is sometimes used since, in contrast to 
empirical Bayes, the scores are conditionally unbiased. However, this approach 
is not consistent with the modeling assumptions and will not yield predictions 
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for clusters with insufficient information. Furthermore, the method cannot be 
applied to truly multilevel models. 

There are various ways of defining standard errors. The most commonly 
used are the posterior standard deviation, equal to the standard deviation 
of the prediction error in the continuous case, and the marginal sampling 
standard deviation. Whichever scoring method is used, it is important to use 
the standard error appropriate for the particular application. 

For discrete latent variables, the most common classification method is Em¬ 
pirical Bayes modal since it minimizes the expected misclassification rate. 

In contrast to estimation of latent variable models, the literature on latent 
scoring and classification is relatively scant. However, useful books on empiri¬ 
cal Bayes are Maritz and Lwin (1989) and Carlin and Louis (1998). Empirical 
Bayes prediction in linear random coefficient models is reviewed by Strenio et 
al. (1983). A nice overview of the Bartlett and regression methods for factor 
analysis is given in Lawley and Maxwell (1971, Chapter 8). 
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Appendix: Some software 

We now provide an incomplete list of software implementing the different 
methods of assigning values to latent variables. We do not provide addresses 
or links to the software since such information is quickly out of date and can 
easily be found on the internet. 

• Empirical Bayes 

- aML uses quadrature for multilevel and multiprocess models (Lillard 
and Panis, 2000), 

- BILOG-MG uses quadrature for binary logistic item-response models 
(Zimowski et ai., 1996; Du Toit, 2003) 

— gllapred，the prediction command for gllamm, uses adaptive quadra¬ 
ture for generalized linear latent and mixed models (Rabe-Hesketh et 
al. ， 2001b, 2004c) 



for two-level generalized linear mixed models (Hedeker and Gibbons, 
1996ab; Hedeker, 1999), 


— TESTFACT uses adaptive quadrature for multidimensional probit factor 

models (Bock et al., 1999) 

• Empirical Bayes modal 

— Scoring 

* BILOG-MG for binary logistic item-response models (Zimowski et 
al” 1996; Du Toit, 2003) 

* HLM uses PQL or LaPlace6 for multilevel generalized linear mixed 
models (Raudenbush et al” 2001) 

* MLwiN uses PQL for multilevel generalized linear mixed models (Ras- 
bash et al, 2000) 

* Mplus for structural equation models with continuous, dichotomous, 
ordinal and censored responses (Muthen and Muthen, 1998, 2003) 

* SAS NLMIXED for two-level generalized linear mixed models (Wolfin- 
ger, 1999) 

— Classification: 

* Mplus for latent class models with continuous, dichotomous, ordinal 
and censored responses (Muthen and Muthen, 1998, 2003) 

* Latent GOLD for most response types (Vermunt and Magidson, 2000, 
2003a) 

* gllamm (posterior probabilities) for generalized linear latent and mixed 
models in St at a (Rabe-Hesketh et al, 2001b, 2004c) 

• Maximum likelihood 

— BILOG-MG for binary logistic item-response models (e.g. Du Toit, 2003) 
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CHAPTER 8 


Model specification and inference 


8.1 Introduction 

In Chapter 6 we discussed the problem of estimating the parameters of a 
given statistical model without considering why that particular model was 
specified. In this chapter we consider the perhaps more difficult task of finding 
an appropriate model. 

Before considering how to proceed with model specification we reflect upon 
statistical modeling in Section 8.2. Specifically, we discuss the roles and pur¬ 
poses of different kinds of statistical models as well as modeling strategies. 
In Section 8.3 we review maximum likelihood inference which is useful for 
assessing the uncertainty of estimates and forms the basis of many 4 relative 
fit’ criteria for model selection discussed in Section 8.4. It could be argued 
that relying solely on relative fit，all we can say is that a model appears to be 
better than the competitors, but little is known concerning how good or bad 
the better model is in an absolute sense. Furthermore, misspecification of the 
candidate models could invalidate model selection. In Section 8.5 we there¬ 
fore discuss methods of ascertaining how good the best model is using 4 global 
absolute fit’ criteria. In contrast to global absolute fit criteria, ‘local absolute 
fit’ criteria can be used not only to discover that a model is inadequate but 
also to diagnose where a model is misspecified. Such diagnostics are discussed 
in Section 8.6. 

The organization of this chapter may seem to imply that model building 
proceeds in the following sequence: (1) select the ‘best’ model among a set 
of models, (2) assess the adequacy of the selected model (if feasible), (3) use 
diagnostics to pinpoint misspecification，which may suggest different ways of 
elaborating the model taking us back to (1). However, this sequence has no 
particular theoretical justification and other sequences may be just as useful. 


8.2 Statistical modeling 

8.2.1 Types of statistical model 

Broadly speaking, in applications of mathematical models in empirical research 
the relationship between variables is formalized in terms of one or more de¬ 
terministic equations. Such models are appropriate in situations where there 
are known deterministic laws governing the relationships, as is sometimes the 
case in the natural sciences, a typical example being Newton’s laws of motion. 
In contrast, statistical models are mathematical models which also include a 


© 2004 by Chapman & Hall/CRC 




random or stochastic component in addition to the deterministic component. 
The random component may represent measurement error, making statistical 
models useful also in studying deterministic laws. More importantly, it could 
represent ‘natural variation 5 or stochastic causal laws, for instance those of 
Mendelian inheritance. Finally, the random component may reflect our in¬ 
complete knowledge regarding a deterministic 4 law’ governing the empirical 
phenomena under consideration. Note that some statistical tools do not in¬ 
volve statistical models since they have no random component. An example 
is principal component analysis which is merely an orthogonal transformation 
of the data. 

Two types of statistical models have typically been delineated in the lit¬ 
erature. We adopt the terms substantive and empirical models used by Cox 
(1990). Lehmann (1990) describes similar distinctions put forth by Neyman 
(e.g. Neyman, 1939) who contrasted Explanatory models’ versus 4 interpola- 
tory formulae’ and Box and colleagues (e.g. Box et al” 1978) who contrast 
‘theoretical’ or ‘mechanistic’ models versus ‘empirical models’. 


Substantive models 

The most appealing statistical models are substantive models which connect 
directly with subject matter considerations and background information, con¬ 
stituting an effort to achieve understanding and explanations, i.e.，answers to 
4 why questions’ in the terminology of philosophers. Typically the researcher 
believes that there is a single 4 true 5 model that has generated the data. 

‘Directly substantive models’ explain what is observed in terms of explicit 
mechanisms, usually via quantities that are not directly observed and some 
theoretical notions as to how the system under study ‘works’. In the natu¬ 
ral sciences there may be one or at most a few stringent competing theories 
purporting to explain the observations in terms of lawlike relationships with 
a specified functional form. Neyman’s favorite example was Mendelian inher¬ 
itance. To 4 test’ a theory the researcher can sometimes vary a ‘treatment’ of 
interest under controlled conditions using randomization. 

A weaker type of substantive model merely posits substantive hypotheses 
about dependencies, for instance in graphical or structural equation model¬ 
ing where some variables may be specified as conditionally independent given 
other variables. An example is the structural equation model positing Com¬ 
plete mediation’ mentioned in Section 1.3, where a variable only has an indi¬ 
rect effect on the outcome via an intermediate variable and no direct effect. 
Such models are useful in the social sciences and medicine where there are 
often several rather loose and sometimes conflicting ideas about how more or 
less fuzzy phenomena are associated. Here, information often stems from an 
existing body of empirical research on the problem under investigation and re¬ 
lated problems. However, studies are typically based on observational designs 
where the researcher is merely a passive observer of the empirical process, 
making it daunting to investigate causality. A somewhat stronger design is 
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the quasi-experiment (e.g. Cook and Campbell, 1979) where the researcher 
can vary the treatment but randomization is for some reason unfeasible. 

Since a substantive statistical model is a simplified representation of the 
data generating process, it should be possible to simulate data directly from 
the model. A good substantive model should be parsimonious and yet capture 
the main features of the data generating process; it should neither be too 
complex nor too simplistic. An overly complex model is of little use since it 
merely mirrors the realized data instead of the underlying process and is likely 
to be a poor representation of other realizations of the process. On the other 
hand, an overly simplistic model may fail to capture important aspects of the 
data generating process and lead to incorrect inferences. 

Cox and Wermuth (1996) suggest that a satisfactory substantive statistical 
model should: 

1. establish a link with background knowledge 

2. set up a connection with previous work 

3. give some pointer toward a data generating process 

4. have primary parameters with clear subject-specific substantive interpre¬ 
tations 

5. specify haphazard aspects well enough to provide meaningful assessment of 
precision 

6. have adequate fit 
Empirical models 

Empirical models are the more common type of model in many applications 
where background information is relatively scarce. According to Box, empirical 
models are used as a guide to action, with emphasis on prediction. The models 
are intended to provide guidance for the particular situation at hand, using 
all special circumstances, which means that good approximations can only 
be expected over the area of interest. Empirical models may be obtained 
from a family of models selected largely for convenience, on the basis solely 
of the data without much input from the underlying situation. Instead of 
believing that there is one true model, the researcher is looking for one among 
several potentially useful models. A modern version is the algorithmic models 
advocated by Breiman (e.g. Breiman, 2001)，a black-box strategy focusing on 
effective and flexible algorithms for prediction of output from input. No effort 
whatsoever is made to explicate the black-box whose contents are treated as 
unknown. Typical examples would be neural networks or degression forests’. 

Cox (1990) considers a less extreme kind of empirical model which is not 
based on specific substantive considerations but rather aims to represent 
in idealized form dependencies, often ‘smooth’ dependencies, thought to be 
present. According to Cox, the first and most common role for empirical mod¬ 
els is to estimate effects and their precision. The widespread use of regression 
models is a canonical example, for instance the estimation of associations from 
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logistic regression in epidemiology. It is important to note that these kinds of 
empirical models are not void of substantive considerations, for instance in 
the choice of confounders in epidemiology. Another role of empirical models 
is ‘correction of deficiencies in data’ such as measurement error, missing data 
and complex sampling. This less extreme notion of empirical model is similar 
to the idea of a weak substantive model. 


8.2.2 Modeling strategies 

The problem of specification, the task of specifying a low-dimensional para¬ 
metric statistical model, was the first kind of problem of statistics mentioned 
by Fisher (1922). Interestingly, his discussion of specification was confined to 
a single paragraph dominated by the first sentence: 

“As regards problems of specification, these are entirely a matter for the prac¬ 
tical statistician,...” 

Lehmann (1990) interprets Fisher’s statement to imply that there can be no 
theory of modeling and no modeling strategies, but that instead each problem 
must be considered entirely on its own merits. 

However, in practice one of four kinds of modeling strategies is typically 
adopted (see Joreskog (1993) for a similar classification). For substantive mod¬ 
els a natural modeling strategy is the strictly confirmatory approach involving 
one or perhaps two models. For instance, in measurement modeling one might 
wish to determine whether a particular independent clusters structure (see 
Section 3.3.3) holds by either retaining or rejecting this model based on abso¬ 
lute fit criteria; see Section 8.5. Clinical trials often involve two models where 
the null model is typically that there is no effect of a drug and the alternative 
model is that there is an effect. Model selection then proceeds by hypothesis 
testing. It is important that the models are specified in advance (giving rise to 
the term planned comparisons 5 in ANOVA) and not suggested by the same 
data on which they are tested. Unfortunately, models or ‘theories’ suggested 
by the data are often presented as if they had been formulated in advance to 
make conclusions more credible. To prevent such malpractice in drug devel¬ 
opment (where it could have lethal consequences), it is becoming common to 
prepare a detailed analysis plan before a clinical trial is conducted. The anal¬ 
ysis plan specifies the ‘primary hypotheses’ and the exact manner in which 
they are to be tested. 

Another modeling strategy for substantive models is the competing models 
approach where a moderate number of alternative models are specified from 
which one is selected. This is appropriate if there are a few competing theories 
purporting to explain a phenomenon and one desires to dispel faulty ones. 

For empirical or weak substantive models a natural modeling strategy is 
the model generating approach where an initial tentative model is specified 
based on the available background information. If this model is deemed to be 
unacceptable according to diagnostic and/or fit criteria, it is modified either 
according to background theory or to achieve a better fit to the data. This 


© 2004 by Chapman & Hall/CRC 








iterative process of specification, estimation, confrontation with data and re¬ 
specification proceeds until the model is found acceptable. 

For empirical models the typical modeling approach is strictly exploratory 
where models are ‘derived’ from the data. In practice, this corresponds to 
starting with a very large number of competing models and using purely sta¬ 
tistical criteria for choosing amongst them. A common example is best-subset 
linear regression where any subset of a large number of covariates is consid¬ 
ered. This approach may yield useful models for prediction if combined with 
some form of cross-validation (see page 271 for further discussion) to avoid 
overfitting. 

Importantly, the exploratory approach should not be used to derive sub¬ 
stantive models. It is well known that results from such 4 data-dredging’ are 
at best suggestive and should be assessed on independent data using a confir¬ 
matory approach. Freedman (1983) shows that theories can easily be derived 
from pure noise. He simulated 51 independent standard normal variables for 
100 units, treating the last as response variable and the remaining as potential 
covariates in a linear regression. Selecting only covariates with coefficients sig¬ 
nificant at the 25% level produced 4 convincing’ results with many significant 
coefficient at the 5% level. A similar criticism in the context of ‘automatic 
interaction detection’，regression analysis, factor analysis and nonmetric mul¬ 
tidimensional scaling was presented by Einhorn (1969). It is well worth citing 
from the conclusion in his paper 4 Alchemy in the Behavioral Sciences ’： 

“It should be clear that proceeding without a theory and with powerful data an¬ 
alytic techniques can lead to large numbers of Type I errors. Just as the ancient 
alchemists were not successful in turning base metal into gold, the modern re¬ 
searcher cannot rely on the “computer” to turn his data into meaningful and 
valuable scientific information.” 

Although the distinction between confirmatory and exploratory approaches 
and its implications apply to any kind of modeling, we now discuss it in the 
context of factor analysis. 

Example: Confirmatory versus exploratory factor analysis 

Confirmatory factor analysis (CFA) is a hypotheticist procedure designed 
to test hypotheses about the relationships between items and factors, where 
the number and interpretation of the factors are given in advance. Hence, in 
the confirmatory mode, particular parameters are set to prescribed values. 
Exploratory factor analysis (EFA), on the other hand, can be construed as 
an inductivist method designed to discover an optimal set of factors that 
accounts for the covariation among the items (see Mulaik, 1988b; Holzinger 
and Harman, 1941). Mulaik (1988ab) gave three reasons why EFA cannot 
deliver what is promised in this inductivist programme: First, there are 
no rationally optimal ways to extract knowledge from experience without 
making prior assumptions. Second, the interpretation of an EFA is not 
unique, due to the factor indeterminacy problem (Guttmann, 1955). Third, 
it is difficult to justify the results of a model which, in principle, can never 
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be falsified. An ambiguity in EFA was pointed out by McDonald (1985, 

p.102): 

“In the exploratory approach, it might be claimed, we do not behave consis¬ 
tently. We first fit the model with many parameters and no constraint due to 
simple structure. We then transform the result to an equally fitting approx¬ 
imation to simple structure that may be very poor and speak as though we 
now have fewer parameters. But either the low numbers in the simple struc¬ 
ture are consistent with exact zeros in the population or they are not. If they 
are, we should estimate only the nonzeros. If they are not, we do not in fact 
have simple structure at all.” 

It has be argued that the results from EFA have heuristic and suggestive 
value (e.g. Anderson, 1963) and may uncover hypotheses which are capa¬ 
ble of more objective testing by other methods of multivariate analysis 
(Hotelling, 1957) and in new datasets. However, the prospects of obtaining 
sensible hypotheses from EFA are bleak, as was forcefully demonstrated 
by Armstrong (1967) (see also Mukherjee, 1973). Armstrong argues that 
meaningful EFA is only possible when considerable prior information is 
available, in which case CFA should be used in the first place (see also 
Section 1.3). 

The use of confirmatory models in scale development can also be criticized. 
Here, researchers sometimes have 4 pet’ models such as the unidimensional 
Rasch model discussed in Section 3.3.4 and discard items contradicting the 
model to ensure a good fit. Goldstein (1994) points out that this approach 
is dubious because good fit is seen as supporting unidimensionality of the 
latent variable. 

From now on we assume that the aim is to select a weak substantive model 
using modeling strategies ranging from the strictly confirmatory to model gen¬ 
erating. It is important to note that a number of crucial decisions are usually 
made on a more or less heuristic basis before formal statistical modeling is 
undertaken, including: 

• Selection of a model class (e.g. multilevel models) 

• ‘Causal’ ordering of the variables: 

— Regression models: Typically classification into a set of explanatory vari¬ 
ables and another set of response variables (sometimes just one). 

- Structural equation models: More elaborate ordering with explanatory 
variables, intermediate variables and response variables 
— Latent variable models: Selection of variables ‘measuring’ the latent vari¬ 
ables 

• Specification of probability distributions 

Given these decisions, the role of statistical modeling is usually to help make 
decisions regarding model form] for instance which explanatory variables to 
include, which interactions to include, which 4 paths，to include in structural 
equations models and which items measure which latent variables. 
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The adopted modeling strategy typically depends on the model class as well 
as the subject matter area. For instance, for linear mixed modeling in a bio- 
statistical context, Verbeke and Molenberghs (2000, Chapter 9) suggest the 
following sequence for model building. First, find a preliminary model for the 
fixed part, selecting covariates and determining their functional relationships 
with the response. Second, find a preliminary model for the random part, de¬ 
ciding which effects to treat as random. Third, find a reasonable model for the 
level-1 error, for instance whether heteroscedasticity and/or autocorrelation 
should be specified or not. 


8.3 Inference (likelihood based) 

8.3.1 Properties of the fundamental and structural parameter estimates 

Having estimated the parameters by maximum likelihood, the next question 
concerns the properties of the parameter estimates. 

Consider first the estimated fundamental parameters Since ^ is an ML 
estimator, it follows from e.g. Cox and Hinkley (1974) that it has a number 
of nice theoretical properties under suitable regularity conditions. Specifically, 
^ is consistent, asymptotically normal, and asymptotically efficient. Consider 
then the estimators 0 of the structural parameters 0 = h(t^). First, since it is 
well known that ML-estimators are invariant under transformations (e.g. Cox 
and Hinkley, 1974, p.287), the ML estimator oi 6, 0, is given by 

e = h(d). (8.i) 

It also follows that 0 inherits the asymptotic optimality properties of 
Rubin (1976) shows that consistency is retained for maximum likelihood 
estimators if responses are missing at random (MAR). This requires that the 
probability that a response is missing does not depend on the value of the 
response had it been observed, although it may depend on covariates included 
in the model and other responses. Importantly, responses are not required 
to be missing completely at random (MCAR) where missingness does not 
depend on either covariates, observed responses or missing responses. Little 
(1995) points out that methods such as GEE which are often said to require 
MCAR really remain valid when missingness is covariate dependent. 

In the context of random effects models for longitudinal data, Little (1995) 
distinguishes between covariate dependent dropout, missing at random dropout, 
nonignorable outcome-based dropout and nonignorable random-coefficient-based 
dropout. Let yj denote a vector of both observed and unobserved (missing) 
responses y^- = [y 0 bs, ： / ， ymis ， j] an d let rj be a vector of missingness indicators 
for a unit j. The different types of dropout can be defined as: 

• Covariate dependent dropout 

Pr^ly^X^C,) = Pr(r,|X,), 
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• Missing at random dropout 

Pr ( r j|yj> x j>Cj) = Pr^lyob^^Xj), 

• Nonignorable outcome-based dropout 

Pr ( r j.|yj, x i,Cj) = PrWy—Aymi 心， XJ, 

• Nonignorable random-coefficient-based dropout 

Pr ( r ilyj, x j.,Cj) = Pr^lyob^^x^Cj)- 

We believe that this classification is also useful for general latent variable 
models with nonmonotone missing data patterns (intermittent missingness in 
the longitudinal setting). 

If missingness is either covariate dependent or at random (and the pa¬ 
rameters of the substantive process and the missingness process are distinct) 
inference can be based solely on the likelihood for the observed responses (the 
substantive process). This is because the joint likelihood of the substantive 
and missingness processes decomposes into separate components. Unfortu¬ 
nately, this useful result does not hold for the two nonignorable missingness 
processes where both substantive and missingness processes must be mod¬ 
eled jointly (e.g. Heckman, 1979; Hausman and Wise, 1979; Wu and Carroll, 
1988; Diggle and Kenward, 1994). We refer to Little and Rubin (2002) for an 
extensive discussion. 


8.3.2 Model-based standard errors 
The asymptotic covariance matrix of is 

Cov(S) = -E(H ( 芬 ) 疒 1 兰 -H(S)- 1 , (8.2) 

where —E(H(t9)) is the Fisher information or expected information and — 
minus the Hessian of the log-likelihood, is the observed information. The ob¬ 
served information approximates the expected information due to the strong 
law of large numbers. 

There are three motivations for using the observed information in place of 
the expected information. The first reason is purely practical. The observed 
information is a by-product of the Newton-Raphson algorithm and therefore a 
natural choice when this algorithm is used for parameter estimation. Second, 
as was pointed out by Laird (1988) and Schluchter (1988) among others, use 
of the expected information is problematic in the context of missing data 
satisfying MAR but not MCAR. This is because one would have to integrate 
over the missing data mechanism to obtain the correct expected information. 
Third, Efron and Hinkley (1978, p.459) argue that the observed information 
is ‘closer to the data’ than the expected information (see also Kendall and 
Stuart, 1979), and that it tends to agree more closely with Bayesian analyses. 

We can derive the covariance matrix of the estimated structural parameters 
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from that of the estimated fundamental parameters using the multivariate 
delta-method (e.g. Serfling ， 1980) applied to the function in (8.1): 


C 。 •(警 )_ (警 )'• 


8.3.3 Robust standard errors 


The total log-likelihood is the sum of the top-level cluster contributions 

n L 

m = YM 外 

Z=1 


where 

4 ⑼ =ln fl (i) (y 啦 )|X z(i) 4). 

Therefore maximum likelihood estimators satisfy the likelihood equations 


n( L ) 

g(d) = ^g z =0, (8.3) 

Z=1 


where g z is the score vector for the zth level-L unit, 

gZ = 

Using the delta-method, we can write the covariance matrix of g(^) as 

—)] = (#)_ (眷 )'. 

Solving for Cov ( 汐 ） gives 

―卜{響厂—”{(響 )'} _1 

=H-'Covfe^H- 1 , (8.4) 

where H 三 H(t^) is the Hessian of the log-likelihood at the parameter es¬ 
timates. If the model is correct, Cov[g(i?)] = —E(H) = —H and therefore 
Cov (句 =—E(H) _1 = —H— 1 as in equation (8.2). 

Instead of relying on the model being correctly specified, we can utilize that 
g (句 in (8.3) is a sum of independent score vectors g z with mean 0, so that 
the empirical covariance matrix becomes 


Cov[g(S )]= 二 _ i ggf)’. 


(8.5) 
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Substituting (8.5) into (8.4) gives the so-called sandwich estimator; see for 
instance Huber (1967) and White (1982). This can be seen as an estimator 
of the covariance matrix of the design-based sampling distribution for the 
estimates defined as (implicit) functions of the data values (Binder, 1983). 
The sandwich estimator is popular in generalized estimating equations (see 
Section 6.9) and for complex survey data with sample weights, where the log- 
likelihood contributions are weighted by the inverse selection probabilities, 
giving a ‘pseudo-likelihood’. 

If the highest level units of the multilevel model are not mutually inde¬ 
pendent but clustered in n c mutually exclusive clusters with index sets C m ， 
m = 1, … ， n c ，then 

Cov[g(S)] = f E 4 f 4 ， （ 8 . 6 ) 

c m=i \zec m / / 

see Wooldridge (2002, Section 13.8.2) and Williams (2000) for proofs. Muthen 
and Satorra (1995) suggest using this approach for linear structural equation 
models for clustered data with inverse probability weighting. 

Obvious alternatives to the sandwich estimator are resampling methods 
such as the bootstrap and the jackknife. Meijer et al. (1995) and Busing et 
al. (1994) discuss parametric and nonparametric bootstrapping for linear two- 
level models. There are two types of nonparametric bootstrapping, one based 
on resampling cases and the other on resampling residuals or errors. Neither 
are straightforward for multilevel data. For the 4 cases bootstrap’，it is not en¬ 
tirely clear whether to resample clusters and units within clusters, only clusters 
or only units within clusters. For the 4 error bootstrap’ it is not clear how to 
estimate the higher-level residuals because of shrinkage (see also Carpenter et 
al” 1999). Patterson et al. (2002) use the jackknife for a latent class model 
with sample weights and Busing et al. (1994) for two-level linear models. 


8.3,4 Likelihood ratio, Wald and score tests 

Let and M .2 denote two contending models with 外 and V 2 fundamental 
parameters ^Mi and 分 m 2 , respectively. Assume that M .2 is nested in A^i, in 
the sense that restrictions are imposed on the structural parameters of Mi 
to yield a model M 2 with — fewer fundamental parameters. Let the 
maximized log-likelihoods for the two models be denoted €(1^1 |y ， X) and 

响 7w 2 |y,x). 

Conventional likelihood-ratio testing can then be performed using the statis¬ 
tic 

= 2pd|y,XHd|y ， X)j, (8.7) 

which under regularity conditions is asymptotically x 2 -distributed with v\ —V 2 
degrees of freedom under the restricted model M 2 (e.g. Cox and Hinkley, 
1974). 

Wald-tests can be derived from Cov(i^Mi) and Lagrange multiplier or score 
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tests from Cov(^^ 2 ) - These test statistics, which only necessitate the esti¬ 
mation of one model, M\ or M 2 respectively, can be regarded as quadratic 
approximations to the likelihood ratio test statistic. It is well known that Wald 
and Lagrange multiplier tests are asymptotically equivalent to the likelihood 
ratio test (e.g. Cox and Hinkley, 1974; Buse ， 1982; Engle, 1984). 

However, in the finite sample situation the choice of test statistic can be 
important. The Wald test performs poorly if the log-likelihood is not well 
approximated by a quadratic function in the neighborhood of the parameter 
estimates. Hauck and Donner (1977) show how this can happen in logistic 
regression. Note that, unlike the likelihood ratio test, the Wald test is not 
invariant to nonlinear transformations of the parameter, some transformations 
being preferable to others. If the Wald and likelihood ratio tests yield different 
results, the likelihood ratio test is preferable. 

The score test can be justified using a central limit theorem argument, 
not just as an approximation to the likelihood ratio test (e.g. Pawitan, 2001, 
p.235). In some situations the score test performs better than the likelihood 
ratio test. An advantage of the score test over the Wald test is that it is 
invariant to nonlinear transformations of the parameters. We refer to Pawitan 
(2001) for an excellent and accessible account of likelihood theory. 

Unfortunately, standard asymptotic results for the likelihood ratio, Wald 
and score test statistics do not hold if the null hypothesis is on the boundary 
of the parameter space which would violate regularity conditions. A common 
example is testing of the null hypothesis that one or more variance components 
are zero. Consider two two-level models M\ and M 2 where M\ contains 
one extra variance ^kk and M extra covariances 岭幻， j = 1, ..., M, j ^ k. 
Since is nonnegative, it lies on the boundary of the parameter space, 
'ipkk = 0, under M\. The correct asymptotic distribution of the likelihood 
ratio statistic is a 50:50 mixture of a point mass at zero and a % 2 distribution 
with M + 1 degrees of freedom (Moran, 1971; Miller, 1977; Self and Liang, 
1987; Berkhof and Snijders, 2001). Hence, the asymptotically correct p-value 
is found by computing the p-value that corresponds to the x 2 distribution 
with M + 1 degrees of freedom and dividing it by two (Berkhof and Snijders, 
2001). Verbeke and Molenberghs (2003) derive general one-sided score tests 
for variance components in models with several random effects. For relatively 
nontechnical discussions of these issues we refer to Snijders and Bosker (1999) 
and Verbeke and Molenberghs (2000). 

Another violation of regularity conditions occurs in latent class models 
where a K—l class model cannot be obtained by imposing a simple restriction 
on the K class model. For instance, fixing the probability of one class to zero 
renders the corresponding location nonidentified. Alternatively, setting the lo¬ 
cations of two classes equal implies that only the sum of the corresponding 
probabilities becomes identified. Therefore, likelihood ratio statistics do not 
have a chi-squared distribution in this setting (e.g. Aitkin and Rubin, 1985; 
Titterington et al” 1985; McLachlan and Basford, 1988; Everitt, 1988). A pos¬ 
sible solution is parametric bootstrapping where data are simulated from the 
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K—l class model followed by estimation of both the K — l class and K class 
model to compute the likelihood ratio statistic. The empirical distribution of 
these statistics over bootstrap replications is then used to obtain approximate 
significance probabilities. We refer to Bohning (2000, Chapter 4) for a detailed 
discussion and simulations. 

8.3.5 Confidence intervals 

Confidence intervals can be constructed by inverting the likelihood ratio, Wald 
or score (Lagrange multiplier) tests. To construct a 100(1 — a)% confidence 
interval for a parameter /?, we need to find a lower confidence limit 淡 so that 
the one-sided test of the null hypothesis /? = /?； with alternative hypothesis 
P > /3i has a p-value equal to a/2. The upper confidence limit is obtained 
analogously. 

The Wald-based confidence interval simply becomes 

这备料 - a/2 SE ( 功， 

where ^i_ a /2 is the 1 — a/2 fractile of the standard normal distribution and 
SE(/3) is the estimated standard error of /3. 

However, for score and likelihood based intervals, a search is required to 
find the confidence limits. For the likelihood based interval, this requires eval¬ 
uating the log-likelihood at the maximum likelihood estimates of the other 
parameters for each fixed value of the parameter (3 of interest, giving the 
profile log-likelihood. The confidence limits are those values of f3 where the 
profile log-likelihood is Xi-a(l) lower than the log-likelihood maximized with 
respect to all parameters, where Xi-«(1) ^ the 1 — a fractile of the chi-squared 
distribution with one degree of freedom (3.84 for a 95% confidence interval). 

In the application chapters we present estimated standard errors for all pa¬ 
rameter estimates, allowing Wald-based confidence intervals to be constructed 
if desired. The profile log-likelihood method (based on the deviance) is used 
in Section 9.7 for deriving a confidence interval for the population size of 
snowshoe hares. In Section 9.4 we use profile log-likelihood based confidence 
intervals for the guessing parameter in a three-parameter logistic item response 
model. 

Construction of confidence intervals for variance components is problem¬ 
atic, particularly if the estimates are close to zero. Bottai (2003) recommends 
that confidence regions should be based on a score test using the expected 
information. 

8.4 Model selection: Relative fit criteria 

Competing models are usually compared using relative fit criteria (e.g. Joreskog, 
1974; Tanaka et al, 1990; Tanaka, 1993). 

As pointed out in Section 5.3, equivalent models cannot be distinguished 
empirically although they may represent different or even contradictory sub- 


© 2004 by Chapman & Hall/CRC 








stantive processes. The same applies to nonequivalent models which happen 
to yield similar fit for the particular data set being analyzed. For these cases 
model selection must proceed by substantive or other nonstatistical argu¬ 
ments. 


8.4- 1 Significance testing for nested models 

Likelihood ratio, Wald and Score tests can be used to compare nested models. 
This approach is, however, not well suited to model selection for at least five 
reasons. 

One major problem is that we have seldom decided which models to com¬ 
pare a priori. The tests are, on the contrary, suggested by the same data which 
are to be employed for model assessment. In other words, we are in a 4 model 
generating’ situation. Clearly, this situation does not fit the conventional test 
paradigm and the sampling properties of the overall model selection strategy 
are unknown (e.g. Freedman, 1983). A second objection to the conventional 
strategy is that conditioning on a single selected model ignores model un¬ 
certainty and leads to underestimation of standard errors (e.g. Miller, 1984). 
A third problem with the conventional testing approach is that the power 
of hypothesis tests depends on sample size. While acknowledging that more 
observations imply more information, it nevertheless appears unreasonable to 
base model selection on the test criterion. This point can be made clear by 
considering a situation with a very large number of observations. Here, we ex¬ 
pect all but extremely complicated models to be rejected and we are left with 
‘models’ which merely mirror the particular data set at hand. On the other 
hand, if few observations are available, we expect that oversimplifications tend 
to be retained. Fourth, it should be recognized that models of interest may 
be nonnested. Hence, in this case, investigation of fit cannot be based on the 
traditional test criterion. However, tests for nonnested models such as those 
suggested by Cox (1961, 1962) and related tests surveyed by Davidson and 
MacKinnon (1993, Chapter 11) can be used. Fifth, it has been argued that 
significance probabilities and evidence are often in conflict, even for the unre¬ 
alistic case of solely two nested models (Berger and Sellke, 1987; Berger and 
Delampady, 1987). 

However, it should be noted that significance probabilities need not be in¬ 
terpreted in a strict sense, but merely as representing less formal indices of fit 
(e.g. Joreskog, 1969, 1978). 

8.4- 2 Bayesian model selection 

Posterior odds, Bayes factors and the Bayesian information criterion (BIC) 

Suppose that we want to use the data D to compare a set of possibly non¬ 
nested competing models. A Bayesian approach is to compare the posterior 
probabilities of the models given the data. 

By Bayes theorem, the required posterior probability for a given model M，k 
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Pr 叫 D_ Pr(D|A4)Pr(A4)/Pr(D )， 
where Pr(Al/c) is the prior probability of the model, Pr(D|A^fc) the marginal 
probability of the data given the model and Pr(D) the marginal probability of 
the data. The extent to which the data support Mk over a competing model 
Mi can be assessed by the posterior odds of Mk against Mi, 

Pr(M fc |D) 

Pr(M|D) 

where the first term in square brackets is the prior odds and the second term is 
the Bayes factor. Often, we have little reason to favor one model over another 
a priori and therefore assign equal prior probabilities to the models so that 
the posterior odds reduce to the Bayes factor. 

Note that many Bayesians would not select a single model but rather base 
inference regarding the parameter (s) of interest on the posterior distribution 
of the parameter(s), averaging over models. Such model averaging is often 
accomplished by using approximate posterior probabilities of the models given 
the data as weights (see e.g. Wasserman, 2000). 

The marginal probability of the data given the model, often called the 
integrated likelihood, is given by 

Pr(D|A^) = J Pr(D\^ k ,Mk)Pv(-d k \M k )d^ k , (8.8) 

where are the parameters for model Mk- It has been pointed out that the 
integrated likelihoods (and therefore the Bayes factor) depend heavily on the 
prior distributions, even if the priors are vague (e.g. Kass and Raftery, 1995). 

The integral can rarely be evaluated analytically and various approxima¬ 
tions have therefore been suggested. The simplest and most commonly used 
approximation for twice the Bayes factor is the Bayesian Information Crite¬ 
rion (BIC) (e.g. Schwarz, 1978 )， 

BIC = 2 [^Jy,X) |y, X)]- ( 卯 - vi)\nN. (8.9) 

The BIC can be derived using the Laplace approximation introduced in Sec¬ 
tion 6.3.1 for the integral in (8.8). A further approximation is to replace the 
posterior mode of the parameters by the maximum likelihood estimates and 
the Hessian of the log of the integrand by the Hessian of the log-likelihood, 
i.e. the likelihood is assumed to dominate the prior distribution. It can be 
shown that the BIC is a good approximation to the Bayes factor if a 4 unit in¬ 
formation prior’ is used, a multivariate normal with covariance matrix equal 
to the inverse of TV 一 1 times the Fisher information (e.g. Kass and Wasserman, 
1995). For relatively nontechnical derivations of the BIC; see Kass and Raftery 
(1995) and Raftery (1995). 

Although only a crude approximation, the BIC is popular among some 
Bayesians as well as frequentists because it is easily obtained from standard 
output of statistical software. The BIC for a given model Mk, here denoted 
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BIC/e, is usually defined as 

BICfe = -2^ Mfc |y,X) + v k lnN, (8.10) 

and the model with the lowest BICfc is selected. Frequentists find the BIC 
attractive since it handles nonnested models in contrast to the likelihood ratio 
criterion and because it does not require specification of a prior distribution 
for the model parameters. 

Unfortunately, BIC is difficult to apply to models with latent variables be¬ 
cause it is not clear what ‘AT’ should be in the second term of (8.10). For 
instance, in two-level models, including factor, latent class, and structural 
equation models where the level-1 units are items, either the number of clus¬ 
ters J or the total number of level-1 units N have been used. In latent class 
modeling, the predominant approach is to use J (e.g. Vermunt and Magid- 
son, 2000; Clogg, 1995; McCulloch et al, 2002), but the latent class program 
GLIMMIX (Wedel, 2002b) uses N. In structural equation modeling, Bollen 
(1989) and Raftery (1993) use N, but Raftery (1995) recommends using J. 
The BIC is little used in multilevel regression modeling. Hoijtink (2001) uses 
the Bayes factor for latent class models, avoiding the BIC approximation and 
the choice between J and N• 

Another issue with latent variable models is determining the degrees of free¬ 
dom (effective number of parameters). Hodges and Sargent (2001), Burnham 
and Anderson (2002) and Vaida and Blanchard (2004) argue that the degrees 
of freedom lie somewhere between the number of model parameters (in a fre- 
quentist sense, excluding the latent variables) and the sum of the number of 
model parameter and the number of realizations of the latent variables. 

The deviance information criterion (DIC) 

Spiegelhalter et al. (2002) base a measure of model complexity on the concept 
of the ‘excess of the true over the estimated residual information’，defined as 

4i?,^,y} = -2^(t?|y,X) + _y ， X), 

where ^ is the true parameter vector and ^ is the estimated parameter vector. 
This can be thought of as the degree of overfitting since it represents how much 
less the data deviate from the model with estimated parameters than they do 
from the model with true parameters. Spiegelhalter et al. (2002) propose using 
the posterior expectation of this measure, 

- E 1? (-2^|y,X)|y) + 2^|y,X), 

as a Bayesian measure of model complexity or ‘effective number of param¬ 
eters 5 . They also suggest using the posterior expectation of minus twice the 
log-likelihood (the first term above) as a Bayesian measure of fit，the deviance 
information criterion (DIC), 

DI(^E * 糊 y, 聊 ). = -2e(d\y,X) + p D . (8.11) 

In hierarchical Bayesian models, Spiegelhalter et al. (2002) point out that we 
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cannot define model complexity without specifying the level of the hierarchy 
that is the focus of the modeling exercise (conditional or marginal focus in 
the terminology of Vaida and Blanchard, 2004). The focus determines how 
the full probability model (see also Section 6.11) 

Pr(D,C,t?i,t?2) = Pr(D|C,t?i)Pr(C|i?2)Pr(t?2)Pr(t?i), 


is factorized into the likelihood and prior components. Here D are the data, 

C are latent variables, 办 i are 4 fixed 5 parameters and ^2 are hyperparameters. 

A conditional (cluster-specific) focus corresponds to defining the likelihood 
as conditional on the latent variables and 4 fixed 5 parameters, Pr(D|C, *1^)，and 
the prior as marginal with respect to the hyperparameters 

Pr(i?!)Pr(C) = Pr(i?!) J Pr(C|^ 2 )Pr(^ 2 )di? 2 . 

A marginal (population) focus corresponds to defining the likelihood as marginal 
with respect to the latent variables 


Pr(D|A, 仇） 


h 


(D|C,^i)Pr(C|^2)dC 


and the prior as Pr ( 汐 i)Pr ( 汐 2). The effective number of parameters will obvi¬ 
ously be larger for the conditional focus than the marginal one. 


8.4.3 The Akaike information criterion (AIC) 

The Akaike Information Criterion (AIC) (e.g. Akaike, 1987) or its variants 
(e.g. Bozdogan, 1987) are often used for model selection. 

Let /(y |X; 汐 ) denote the distribution of the responses given the parameters 
for a specified model (i.e. the likelihood) and /*(y|X) the distribution for the 
true model. The ‘information lost’ when /(y|X;t^) is used to approximate 
/*(y|X) can be defined as 

!(/，/*/) = J r(y|X)[ln/*(y|X) -ln/(y|X;,?)]dy, (8.12) 

known as the Kullback-Leibler information between the two models. This 
expectation (over the true distribution of y|X) of the difference in true and 
approximate log-likelihoods is large if data from the true model tend to be un¬ 
likely under the specified model. The measure is zero if /(y|X; = /*(y|X) 

and positive otherwise. 八 

It would be natural to plug in parameter estimates ^(y # ), obtained from 
some data y # , into the Kullback-Leibler information. The expectation in (8.12) 
is then over samples y that are independent of y* but from the same distri¬ 
bution and is therefore sometimes interpreted as a cross-validation measure. 
The expectation of the Kullback-Leibler information over repeated samples 
y*|x, 

/ r(y*W,r^(y*|x))dy*, 
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(a double-expectation with respect to y|X and y # |X) forms the basis of the 
Akaike information criterion. The first term of I (/,/* ， 办 ） is constant across 
models and can therefore be ignored in model comparison. Akaike (1973) 
showed that twice the expectation of the second term can be approximated as 

-2 J r(y*|X)ln/(y|X;S(y*|X))dy* « -2^|y, X) + = AIC, 

where ^ are the parameter estimates for the observed data and v is the number 
of model parameters. The term 2v serves to correct the bias in using the 
maximized log-likelihood as an estimator of its double expectation. 

Note that AIC is identical to Mallows C p for conventional linear regression 
models. The AIC, BIC and DIC can all be viewed as deviances with a penalty 
for model complexity. For the BIC this penalty is greater than for the AIC so 
that more parsimonious models tend to be selected. In latent variable models, 
it is not clear what the number of parameters v should be, for the same reason 
discussed for the BIC in Section 8.4.2, see e.g. Vaida and Blanchard (2004). 

See Zucchini (2000) and Wasserman (2000) for discussions of the AIC and 
BIC, the latter from a Bayesian perspective. Recently, a focussed information 
criterion (FIC) has been proposed by Claeskens and Hjort (2003) to select the 
4 best’ model for inference regarding a given parameter of interest. 

We use the AIC and BIC to compare nonnested models for overdispersed 
count data in Section 11.2. There we somewhat arbitrarily use the number of 
fixed parameters (in the fixed and random parts of the models) for v and the 
number of clusters for N which in this case equals the number of units. 

8.5 Model adequacy: Global absolute fit criteria 

Misspecifications can occur in one or more of the model components of the gen¬ 
eral model framework. For the conditional response model all misspecifications 
conceivable in generalized linear models may happen, including omitted vari¬ 
able problems, inappropriate link functions, inappropriate variance functions 
and inappropriate distributional assumptions. In addition, the assumption 
of conditional independence may be violated, random regression coefficients 
mistakenly specified as fixed and inappropriate constraints imposed on fac¬ 
tor loadings and measurement error variances. The structural equations for 
latent variables may be misspecified by omitting relevant observed or latent 
covariates, mistakenly specifying the relations as linear and misspecifying the 
distribution of the disturbances at the different levels of the multilevel model. 

Misspecifications manifest themselves through a lack of fit of the specified 
model to the available dataset. The major challenge is to distinguish between 
lack of fit due to sampling variability, which is not a problem, and lack of fit due 
to using an inappropriate model, which is a problem. Due to the multiplicity 
of potential misspecification problems, it is clear that model diagnostics is a 
daunting task for complex models. 

A natural approach would seem to be to first assess whether there is evi- 
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dence for any type of misspecification before proceeding to identify the source. 
Two common approaches are global tests for misspecification and 4 fit indices’. 
While some of the tests and indices might be sensitive to specific departures 
from assumptions, there are generally several possible sources of discrepancy 
between model and data. 


8.5.1 Tests for misspecification 
Tests of absolute fit 

Tests of absolute fit presuppose the existence of a benchmark. The bench¬ 
mark is typically the saturated model for a given set of variables. When both 
responses and covariates are categorical, the saturated model is the unre¬ 
stricted multinomial model with expected counts equal to observed counts 
for all cells in the full contingency table (e.g. Bock and Lieberman, 1970; 
Bock and Aitkin ， 1981). For multivariate normal response variables (and no 
covariates) the saturated model is a multivariate normal density with unre¬ 
stricted means, variances and covariances. Importantly, tests of absolute fit 
cannot detect omitted variables because the benchmark is relative to the spe¬ 
cific variables included. Also note that there is no absolute standard available 
when more general models are considered. 

Any of the relative fit criteria discussed above can be used to compare 
the model of interest to the saturated model. The likelihood ratio test is 
the most common because it has a known asymptotic distribution under the 
null hypothesis that the restricted model is true. Twice the difference in log- 
likelihood between a model and the saturated model is called the deviance. 
For categorical data an alternative statistic is the Pearson X 2 . The deviance 
is used to assess absolute fit of a latent class model in Section 13.5. Both the 
deviance and Pearson X 2 are used for item response models in Section 9.4. 

These tests are problematic for sparse contingency tables since asymptotic 
results cannot be relied on. In the context of latent trait and latent class 
models, Glas (1988), Reiser (1996)，Reiser and Lin (1999) and Bartholomew 
and Leung (2002) suggest tests based on a collection of marginal tables, such 
as tables for all pairs of variables. 

The logic of hypothesis testing is undermined when absolute fit is tested in 
log-linear or covariance structure modeling (e.g. Bishop et al., 1975; Fornell, 
1983). In covariance structure modeling the null hypothesis corresponds to a 
restricted model and the alternative to the empirical covariance matrix. The 
important thing to note is that the researcher in this case desires to retain the 
null hypothesis in favor of the alternative. Consequently, the null hypothesis 
is maintained when it cannot be rejected. It is clear that the status of null and 
alternative hypothesis is reversed in this case, compared to the standard frame¬ 
work for statistical testing. Fornell (1983) points out the associated problem 
that models are often retained due to small sample size and resulting lack of 
power. Furthermore, and perhaps more surprisingly, weak observed relation- 
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ships among variables (small correlations) increase the chances of retaining 
any model considered. 

The Hausman and White tests 

The Hausman (1978) misspecification test considers two estimators f3 and (5 
which are both consistent if the model is correctly specified but converge to 
different limits when the model is misspecified. 

Consider for example estimation of the fixed regression coefficients in a 
random intercept model. Both the usual maximum likelihood estimator (3 for 
the random intercept model and the ordinary least squares estimator 0 FE 
for the fixed intercepts model (see Section 3.6.1) are consistent if the model is 
correctly specified. However, if the random intercept correlates with one of the 
covariates (see Section 3.2.1 on page 52)，the maximum likelihood estimator 
of the random intercept model becomes inconsistent whereas the ordinary 
least squares estimator remains consistent. Therefore a difference between the 
estimates suggests that the random intercept model is misspecified. 

Formalizing this idea, Hausman suggests the following test statistic: 

w h = (P — 0feY [Cov(/ 3 — ^ FE )j (/3 — ^ FE ), 

where Cov(/3 — /3 FE ) is the covariance matrix of the difference if the model 
is correctly specified. The test statistic is asymptotically \ 2 distributed with 
degrees of freedom equal to the rank of Cov((3 — /3 FE ). Hausman shows that, 
asymptotically, 

Cov(/3 — ^fe) = Cov(/3) — Cov(^ FE ), 
making the test easy to implement since it requires only the estimated covari¬ 
ance matrices of the two estimators. 

Although easy to implement and potentially useful, there are some limi¬ 
tations of the approach. First, in common with other tests of fit, the test 
is sensitive to different kinds of misspecification making it hard to pinpoint 
the problem. Second, simulation studies indicate that the power of the test 
may be low for typical sample sizes (e.g. Long and Trivedi, 1993). Finally, the 
sampling distribution of the test statistic in finite samples may not be well 
approximated by a \ 2 distribution. 

It is known from maximum likelihood theory that the estimated covariance 
matrix of the parameter estimates is given by the sandwich estimator in (8.4). 
If the model is correctly specified, the sandwich estimator reduces to the 
inverse of the information matrix —H _1 . As for the Hausman test, a difference 
between the two estimators suggests that the model is misspecified. White’s 
(1982) information matrix test therefore compares the two covariance matrices 
using the test statistic 

where d is a vector of differences of a subset of the elements of the estimated 
covariance matrices with associated covariance matrix C. This test shares 
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with the Hausman test the problem of being sensitive to many types of mis- 
specification. In addition the test is difficult to implement since it requires an 
estimator of C. 


8.5.2 Goodness of Fit Indices 


As noted earlier the roles of null and alternative hypotheses are reversed in 
tests of absolute fit，where the researcher is hoping to ‘fail’ to reject the null 
hypothesis that the model is true. Curiously, high power then becomes a 
problem. 

One reaction to this problem is to use so-called goodness of fit indices (GFI), 
a typical example being the incremental fit index A proposed for linear struc¬ 
tural equation models by Bentler and Bonett (1980) 

A _ F b~ F m 


Here, F m and are the values of the fitting function used to estimate the 
parameters (see Section 6.2) for the model of interest and a baseline model, 
respectively. This index can be interpreted as the proportional reduction in fit¬ 
ting function between the baseline model and model of interest. In covariance 
structure modeling, a common choice of baseline model is a model imposing 
independence among the response variables. Note that the squared multiple 
correlation coefficient R 2 in linear regression can be defined the same way 
where the fitting function is the sum of squared residuals and the baseline 
model has no covariates. 

An attraction of the GFIs is that they are generally normed between 0 
and 1, with values in the 0.90s typically said to indicate ‘good fit’. A major 
problem of this approach is that model choice gets an arbitrary flair. This 
situation is not helped by the large number of fit indices proposed, see e.g. 
Bollen (1989), Marsh et al. (1988) and Mulaik et al. (1989) for surveys, and 
produced by standard software. 

The choice of baseline or null model is important (Sobel and Bohrnstedt, 
1985). It could be argued that it does not make sense to define goodness of 
fit of a given model relative to a baseline model known to be inadequate. For 
instance in measurement and longitudinal models a baseline model specify¬ 
ing independent responses would a priori be expected to be very wrong. It 
therefore comes as no surprise that models generally obtain high GFIs in this 
case. 

Furthermore, how badly the baseline model fits the data depends greatly 
on the magnitude of the parameters of the true model. For instance, con¬ 
sider estimating a simple parallel measurement model. If the true model is a 
congeneric measurement model (with considerable variation in factor loadings 
and measurement error variances between items), the fit index could be high 
simply because the null model fits very poorly, i.e. because the reliabilities of 
the items are high. However, if the true model is a parallel measurement model 
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with low reliabilities the fit index could be low although we are estimating the 
correct model. Similarly, estimating a simple linear regression model can yield 
a high R 2 if the relationship is actually quadratic with a considerable linear 
trend and a low R 2 when the model is true but with a small slope (relative to 
the overall variance). 

Goldberger (1991 ， p.177) puts it the following way: 

“Nothing in the CR (Classical Regression) model requires that R 2 be high. Hence, 
a high R 2 is not evidence in favor of the model, and a low R 2 is not evidence 
against it.” 

Perhaps the fit indices should therefore be better described as ‘coefficients 
of determination’，a description often used for the R 2 . 


8.5.3 Error of approximation 

In the context of covariance structure modeling Cudeck and Henly (1991) con¬ 
sider discrepancies among four different covariance matrices. When estimating 
a model the fitting function F = F(E, S) compares the nxn sample covariance 
matrix S with the estimated model implied covariance matrix 5] based on 
v parameters. If the model were estimated in the population, the analogous 
matrices would be the true covariance 〜 matrix Eq and the covariance matrix 
implied by the approximating model S. The discrepancy due to approxima¬ 
tion^ defined as Fo = F(E, Sq), is unknown since the population matrices are 
unknown. 

It can be shown that the sample discrepancy function F is a biased estimator 
of the discrepancy due to approximation F 0 . A less biased estimator is F 0 = 
F — J _1 d (see McDonald, 1989), where d is the degrees of freedom (d = 
If Fo is negative, Browne and Cudeck (1993) suggest setting it 

to zero. 

To penalize for model complexity, Steiger (1990) proposes the root mean 
square error of approximation (RMSEA), estimated as 

RMSEA = 

Browne and Cudeck (1993, p.144) state: “We are of the opinion that a value 
of about 0.08 or less for the RMSEA would indicate a reasonable error of 
approximation and would not want to employ a model with a RMSEA greater 
than 0.1.” They also show how confidence intervals and tests for the RMSEA 
can be constructed. We refer to Browne and Arminger (1995) for further 
discussion. 

8.5.4 Cross-validation 

Validating a model on the same data for which the model was built using 
for instance goodness of fit indices leads to overoptimistic assessments. The 
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same problem applies to model selection using relative fit criteria. A remedy 
for these problems is the use of cross-validation, which can be implemented in 
several ways. 

An obvious approach is to split the sample randomly into a calibration 
sample used to estimate candidate models and a confirmation sample for test¬ 
ing the models (see Section 10.3.3, page 334 for an example). However, this 
approach is wasteful since a major portion of the data is discarded both for 
calibration and validation. 

Another approach is to repeatedly estimate the model leaving out one unit 
at a time. The estimates produced when unit i is omitted are used to obtain 
the contribution to the discrepancy measure for that unit (e.g. Stone, 1974; 
Geisser, 1975). Obviously this method can be used only if the discrepancy 
measure is a function of the contributions from the individual units. For in¬ 
stance, it does not work for discrepancy measures based on covariance matrices 
commonly used in structural equation modeling (see Section 6.2). 

A general approach to cross-validation is the use of resampling techniques 
(e.g. Efron and Tibshirani ， 1993, Chapter 17) such as bootstrapping. A simple 
version is to estimate the parameters in each bootstrap sample and obtain the 
goodness of fit index for the original data. The mean index over the bootstrap 
samples is then used as a measure of cross-validation. 

In some situations it is possible to obtain an estimate of the expectation of 
a cross-validation index (over repeated validation and confirmation samples) 
based on data from a single sample. Examples include the adjusted R 2 in 
multiple regression and the expected cross-validation index (ECVI) suggested 
by Browne and Cudeck (1989) for covariance structure models; see also Browne 
and Cudeck (1993) and Browne (2000). 


8.6 Model diagnostics: Local absolute fit criteria 

After finding an indication that something is wrong with the model, the next 
logical step is to diagnose the problem. Model diagnostics are procedures de¬ 
signed to suggest violations of more or less specific model assumptions. 

A first step is usually to derive some statistics reflecting specific model de¬ 
partures such as residuals. The next step is ‘inspect’ the statistics and devise 
more or less formal criteria for detecting problems. This step covers a wide 
range of approaches including graphics, ad hoc application of thresholds and 
formal tests based on theoretical distributions or on simulations such as pos¬ 
terior predictive checks. Finally, we have to decide whether to take action and 
if so, which action. 

Another approach is to investigate specific forms of misspecification directly 
by elaborating or extending the model. The problem of diagnostics is then 
transformed into a problem of model selection. 

Different types of residuals can be defined according to the kind of model 
used and the type of responses. In some cases residuals may be defined as dif¬ 
ferences between model implied and 4 observed’ summary statistics (see Sec- 
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tion 8.6.1). More generally, residuals are often defined for individual units. 
However, in latent variable models there are 4 units’ at different levels and 
residuals can therefore be defined at different levels. In analogy with residuals 
in linear regression, natural higher-level residuals are predictions of the latent 
disturbances or residuals in the model. However, any discrepancy functions 
could in principle be used at any of the levels depending on the type of model 
violation investigated. When there are residuals at different levels, it is not ob¬ 
vious in which order these should be considered. Snijders and Berkhof (2004) 
propose an upward approach, starting at level 1， whereas Langford and Lewis 
(1998) suggest a downward approach. An argument in favor of the upward 
approach in linear multilevel models is that it is in this case possible to define 
level-1 residuals that are unconfounded by level-2 residuals but not vice versa 
(Hilden-Minton, 1995). 


8.6.1 Residuals for summary statistics 

In covariance structure analysis residuals are typically defined as the differ¬ 
ences between the model-implied and empirical covariances or correlations 
(e.g. Costner and Schoenberg, 1973)，which can suggest how the model might 
be elaborated. In a contingency table the obvious residual is based on the dif¬ 
ference between model-implied and observed cell counts, often standardized 
(see e.g. Agresti, 2002). Instead of considering the full contingency table, it is 
easier to investigate residuals or goodness of fit in all pairwise (marginal) ta¬ 
bles. For instance, in latent variable models such pairwise tables may suggest 
that conditional independence is violated for particular pairs of variables (e.g. 
Glas, 1988; Vermunt and Magidson, 2000, Appendix). 


8.6.2 Higher-level residuals 

The theoretical residuals for the clusters at the different levels Z of a multi¬ 
level dataset are the corresponding disturbances <( z ). The scoring methods 
discussed in Chapter 7 can be used to predict or estimate these residuals, 
empirical Bayes being the most common. For linear models, the maximum 
likelihood estimator (called OLS in linear mixed models) may be preferable 
because, as pointed out by Snijders and Berkhof (2004)，they are less depen¬ 
dent on model assumptions. However, if the assumptions of the level-1 model 
have been checked, Waternaux et al. (1989) recommend using empirical Bayes. 
Another argument in favor of empirical Bayes is that the approach can also 
be used for categorical and discrete responses, where the maximum likeli¬ 
hood method can be problematic. For instance, in the case of dichotomous 
responses, estimates for clusters with all responses equal to 1 or all responses 
equal to 0 (precisely the outlier candidates) cannot be obtained. 

The appropriate standard error for empirical Bayes predictions in diag¬ 
nostics is the unconditional sampling standard deviation since this reflects 
the variability in the estimated residuals under repeated sampling from the 
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model. This ‘diagnostic standard error’ was used by Lange and Ryan (1989), 
Goldstein (2003)，Langford and Lewis (1998) and Lewis and Langford (2001) 
for linear mixed models. In Section 11.3.3 we use the approximate sampling 
standard deviation (see equation (7.8) on page 232) of empirical Bayes pre¬ 
dictions for count data to define standardized residuals. 

As pointed out in Section 7.3.2, page 233, the predicted residuals are mutu¬ 
ally correlated not just within the same level but also across levels, although 
this is commonly not recognized. Furthermore, the predictions depend on the 
true values of other latent variables in the same cluster, see Section 7.3.1 ， 
page 229. For these reasons it is difficult to pinpoint the source of any prob¬ 
lem. For example, in a two-level model with a random intercept and slope, 
a cluster with a large true slope but moderate true intercept could have a 
large predicted intercept due to the correlation between predicted intercept 
and true slope. Similarly, a level-2 unit could have a large predicted residual 
because the true residual of another level-2 unit in the same level-3 unit is 
large. 

Instead of attempting to assess the different residuals for a top-level unit 
individually, it may therefore be more advisable to use a discrepancy measure 
based on all the residuals for the top-level unit. For two-level linear mixed 
models, Snijders and Berkhof (2004) define the standardized level-2 residual 
as 

vf V [Cov y (i7f IX,^)]' 1 ^, 

where the covariance matrix is the marginal sampling covariance matrix dis¬ 
cussed in Section 7.3.2, page 231. They show that this residual is identical to 
the maximum likelihood (OLS) counterpart for linear mixed models. Treating 
the estimated covariance matrix as known, Snijders and Berkhof (2004) point 
out that this residual has an approximate chi-squared distribution with M 
degrees of freedom (where M is the number of latent variables at level 2). 

Another possible residual for a top-level cluster in a multilevel model is 
the change in log-likelihood when the cluster is specifically accommodated, 
for instance by including a separate fixed intercept for the cluster (Longford, 
2001). Alternatively, we could use the log-likelihood contribution of a top-level 
cluster, a similar idea to Snijders and Berkhof’s (2004) multivariate residual 
for two-level linear mixed models. 

8.6.3 Assessing latent variable distributions 

In models with normally distributed responses and latent variables, the em¬ 
pirical distribution of the empirical Bayes predictions of the residuals is often 
used to assess normality of the latent variables. Lange and Ryan (1989) use 
this idea to produce weighted normal quantile plots of standardized linear 
combinations of latent variable predictions in linear mixed models. Note that 
empirical Bayes predictions should not be used to assess the normality assump¬ 
tion in models with nonnormal responses because the sampling distribution of 
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the empirical Bayes predictions is in this case unknown. Goldstein (2003, p. 
100) nevertheless uses this diagnostic in models with dichotomous responses. 

Even for linear mixed models this approach is problematic since, to some 
extent, what you put in is what you get out. Specifically, if the multivariate 
normal prior distribution /i(^) of the latent variables ‘dominates’ (i.e. has a 
much sharper peak than) the conditional response distribution given the latent 
variables ^(y|C,X), the posterior distribution of the latent variables c<;(C|y,X) 
will appear normal regardless of the true distribution h(C) leading to a normal 
sampling distribution of the empirical Bayes predictions. This problem was 
demonstrated by Verbeke and Lasaffre (1996) using simulations. Although the 
true latent variable distribution was a mixture of two well-separated normal 
densities, the posterior distribution of the latent variable (wrongly assuming 
a normal latent variable distribution) appeared normal due to shrinkage. For 
these reasons it will often be difficult to detect departures from normality 
using empirical Bayes predictions. 

A solution to this problem could be to use maximum likelihood estimates 
of the residuals since these only depend on 夕 (y|(X) and are not affected 
by the assumed latent variable distribution. An alternative is to relax the 
normality assumption for the latent variables. Verbeke and Lesaffre (1996) 
suggest specifying a mixture of normal densities with a known number of 
components. We prefer using nonparametric maximum likelihood estimation 
(NPMLE) since this semiparametric approach does not require any distribu¬ 
tional assumption for the latent variables. Rabe-Hesketh et al (2003a) showed 
that empirical Bayes predictions based on NPMLE are virtually indistinguish¬ 
able from those assuming normality if normality holds, but not as affected by 
shrinkage when the true latent variable distribution is skewed. This approach 
is used in Section 11.3.3 for longitudinal count data. 

Atikin et al. (2004) suggest a likelihood ratio test to compare the NPMLE 
(with C masses) with the conventional model assuming normality. For the con¬ 
ventional model, C-point Gauss-Hermite quadrature is used so that the model 
can be viewed as nested within the semipar ametric model (with locations and 
probabilities constrained equal to the quadrature locations and weights). A 
potential problem with this approach is that the quadrature approximation 
may be poor if C is small. 


8.6,4 Level-1 residuals 

For the linear regression model, the standardized residual is defined as 

Vi-% 

^ ， 

where a is the estimated residual standard deviation. 

The deletion residual for a unit is the residual using parameter estimates 
derived from the sample when the unit is omitted. The idea is that an outlier 
could lead to an overestimate of the residual standard deviation making the 
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standardized residual too small. Langford and Lewis (1998) give expressions 
for deletion residuals for linear mixed models. 

For generalized linear models, the most common residuals are the Pear¬ 
son, deviance and Anscombe residuals, shown for the Bernoulli and Poisson 
distributions in Table 8.1. 


Table 8.1 Pearson, deviance and Anscombe residuals 



Bernoulli 

Pearson 


Deviance 

sign(t/j - - if 沾 = 0 

sign ( 扒一成 ): 一 21n(j^) if 队 =1 

Anscombe 

B (j;< ,2/3,2/3)-.Bfe,2/3,2/g) ； a； 出 


Poisson 

Pearson 

Vi-^i 

vii 

Deviance 

sign (扒一 ii yi = 0 

sign ( 扒一 - - 仏 )） if 扒 # 0 

Anscombe 

i.5(y? /3 -^ /3 ) 


For models with a latent response formulation such as the probit and logit 
models, Albert and Chib (1995, 1997) and Gelman et al. (2000) use ‘latent 
data residuals’ y* — [jl within a fully Bayesian framework. Similarly, in the 
frequentist setting ‘generalized residuals’ have been defined as the conditional 
expectation of the latent data residual given the observed response y (e.g. 
Gourieroux et al” 1987a; Chesher and Irish, 1987) and ‘simulated residuals’ as 
draws from the posterior distribution of the latent residual given the observed 
response (Gourieroux et al., 1987b). 

If the latent variables were known, we could simply substitute their values 
into the linear predictor and use ju = ^ _1 (P) in expressions for conventional 
residuals to obtain level-1 residuals. However, since the values of the latent 
variables are not known，it is not clear how to define the residuals. In linear 
mixed models, empirical Bayes predictions for the latent variables are often 
substituted (e.g. Langford and Lewis, 1998), yielding the posterior mean of 
the residual. For linear factor and structural equation models with latent 
variables, Bollen and Arminger (1991) use either the empirical Bayes predictor 
(regression method) or the maximum likelihood estimator (Bartlett method) 
to define residuals for the items. They also standardize these residuals using 
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the appropriate sampling standard deviation to identify outlying items for 
units. 

In models with nonlinear link functions, substituting empirical Bayes predic¬ 
tions for latent variables does not produce the posterior mean of the residual. 
In a Bayesian setting, Dey et al. (1998) and Albert and Chib (1995) use the 
posterior distribution of the raw (unstandardized) residual y — given the 
observed responses, whereas Albert and Chib also use the posterior distribu¬ 
tion of the latent data residual y* — ix given the observed responses. There 
is surprisingly little work on level-1 residuals for latent variable models with 
nonnormal responses in the frequentist setting. 

For linear mixed models Hilden-Minton (1995) suggests estimating a sepa¬ 
rate model for each cluster to define level-1 residuals that are not Confounded 5 
with level 2 residuals; see also Snijders and Bosker (1999). 


8.6.5 Identifying outliers 

We let an outlier be a unit or cluster which appears to be inconsistent with the 
specified model. This presupposes that most units or clusters appear consistent 
with the model. Since we do not know the true residuals, outlier detection 
must be based on estimated residuals and their sampling distribution. If the 
sampling distribution of the residuals is normal, as in the 4 linear case’ with 
normal latent variables and responses, a residual can be defined as an outlier 
if it exceeds a certain normal fractile. For other response-types, the sampling 
distribution is generally unknown and simulations can be used to obtain a 
reference distribution. 

We will use Tj to denote a residual or discrepancy statistic for cluster (or 
unit) j. A natural approach would be to flag the cluster (or unit) with largest 
statistic, T max , as a potential outlier. Testing often proceeds as if the corre¬ 
sponding cluster j* had been selected a priori, that is, by comparing T max with 
the sampling distribution of Tj*. However, the correct reference distribution, 
taking the post-hoc selection into account, is the sampling distribution of the 
largest statistic T max . This can easily be accomplished by simulation or para¬ 
metric bootstrapping (e.g. Longford, 2001). In each replication fc, responses 
are first simulated from the model, parameters are then estimated and the 
statistics Tj computed. The empirical distribution of the largest statistics 
T^ ax is then used to obtain a p-value. If we use parameter estimates based 
on the original data for simulating responses from the model, we are unre¬ 
alistically treating the parameters as known. To take estimation uncertainty 
into account, Longford (2001) suggests sampling the parameters from their 
estimated sampling distribution (multivariate normal with covariance matrix 
from the information matrix). 

Somewhat ironically, significance testing for diagnostics has recently become 
popular among Bayesians (e.g. Gelman et al., 2003; Marshall and Spiegelhal- 
ter, 2003). The most common approach is posterior predictive checking (e.g. 
Rubin, 1984), where the predictive distribution’ of a discrepancy statistic T 


© 2004 by Chapman & Hall/CRC 








is defined as 


Pr(T|y obs ) = J Pr(T(y)|L)Pr(L|y obs )dL. 

Here Pr(T(y)|L) is the sampling distribution of T(y) for given parameters L 
and Pr(L|y obs ) is the posterior distribution of the parameters, so that Pr(T) 
can be loosely interpreted as the sampling distribution of T averaged over the 
posterior of the parameters L. Posterior predictive checking is straightforward 
using Markov chain Monte Carlo (MCMC); see Section 6.11.5. For each draw 
of L from its posterior, y is sampled from its conditional distribution given 
L (the normed likelihood). The empirical distribution of T(y) is then the 
required reference distribution. 

In latent variable models, or hierarchical Bayesian models, the parame¬ 
ter vector L includes latent variables, parameters and hyperparameters, L = 
(汐, 0. In this case posterior predictive checking has been criticized for being 
too lenient or conservative (e.g. Dey et al” 1998; Bayarri and Berger, 2000; 
Marshall and Spiegelhalter, 2003). This is because the latent variables Cj are 
sampled from their posterior distribution given the responses for cluster j, 
Yj hs . New responses simulated for these Cj will then resemble the observed 
responses too closely. This problem can be avoided by sampling the latent 
variables from their prior distribution to reflect sampling of clusters as well as 
sampling of units within clusters. The reference distribution for a discrepancy 
statistic Tj for cluster j then becomes 

Pr(7jJ = J C i )Pr(C i l^)Pr(t?|yf s ) d^. 

Marshall and Spiegelhalter (2003) view this ‘full-data mixed replication’ ap¬ 
proach as a computationally convenient approximation to the ideal method 
of cross validation. The reference distribution for cross validation equals the 
above, with the difference that it uses the posterior based on all 

responses excluding those for cluster j. These ideas are easily adapted to the 
frequentist setting. 

8.6.6 Influence 

We have defined outliers as units (or clusters) that appear to be inconsistent 
with the rest of the data. Another type of extreme unit is one with great 
influence on the parameter estimates, in the sense that omitting the unit will 
cause substantial changes. 

The influence of individual top-level clusters on the parameter estimates 
can be assessed using Cook’s distance for the jth. cluster defined as 

Cj = -2gj' 『 V .， 

where g- 7 is the score vector (first derivatives of log-likelihood contribution) 
for cluster j and H is the Hessian of the total log-likelihood. 

Another measure of influence is the change in parameter estimates when a 
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cluster is deleted. Let 办 be the parameter estimates using the full sample and 
^(~j) the estimates when cluster j is deleted. DFBETAS s (_j) for a parameter 
汐 s is then defined as 


d 8 

DFBETASo)= 」 

It can be computationally very heavy to re-estimate the model with each 
cluster deleted in turn. In the context of generalized linear models, Pregibon 
(1981) suggests using one step of the Newton-Raphson algorithm to obtain an 
approximation for ^(_j) using ^ as starting values, 

d (-J )= 芬一 Hr— 1 力 g ^)， （ 8_13) 

where H(_j) is the Hessian without cluster j and g(_j) is the gradient vector 
without cluster j, given by 


SE ( 瓦） 


S(-j) = = S j , (8-14) 

k^o 

since the total gradient vector is 0 at the maximum likelihood estimates. 

There is a simple relationship between Cook’s distance and DFBETAS ob¬ 
tained using the one-step approximation in (8.13). To show this, we first 
use (8.13) and (8.14) to write the score vector as 

g j =〜(§-())_ 

Cook’s distance can then be approximated as 

Cj = -2(d - - d 1 ^), 

« 2(d-d 1 ( _ j) ) / [Cov(S)]' 1 (8.15) 

since H(_j) « H in large samples (where a given cluster makes a small con¬ 
tribution to the Hessian) and H = — [Cov(^)] -1 . The expression on the right- 
hand-side of (8.15) is twice Pregibon’s (1981) one-step influence diagnostic. 

A typical cut-point for Cook’s distance is four times the number of pa¬ 
rameters divided by the number of observations (clusters in this case). For 
DFBETAS, two divided by the square root of the number of observations is 
often used. 

Cook’s distances were applied to linear mixed models by Lesaffre and Ver- 
beke (1998) and to generalized linear mixed models by Ouwens et al (2001) 
and Xiang et al (2002). Ouwens et al (2001) also developed methods to assess 
the influence of level-1 units. We use influence diagnostics in Section 11.3.3 to 
identify influential subjects in a longitudinal dataset. 
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8.7 Summary and further reading 

We have discussed some approaches to model specification and diagnostics 
without providing much guidance on how to proceed. One reason for this is 
that there may not be one optimal strategy. Another reason is that there 
has been relatively little research in this area, particularly for latent variable 
models. Finally, bringing together suggestions from disparate literatures has 
been a daunting task. 

Given that statistical modeling pervades much if not most empirical re¬ 
search, it is somewhat surprising that the literature on statistical modeling 
per se appears to be scarce. Two useful papers are Cox (1990) and Lehmann 
(1990). 

Learner (1978) gives an interesting treatment of 4 specification searches’ from 
a Bayesian viewpoint whereas Harrell (2001) provides an extensive treatment 
of model building for ‘empirical models’. Strategies for model building and di¬ 
agnostics in linear mixed models are discussed by Langford and Lewis (1998), 
Snijders and Bosker (1999), Verbeke and Molenberghs (2000, Chapter 9) and 
Snijders and Berkhof (2004). Different strategies for model building in struc¬ 
tural equation models are discussed in Bollen and Long (1993). 

Useful books on statistical inference include Cox and Hinkley (1974), Lind¬ 
sey (1996) and Pawitan (2001). 

Research on diagnostics for latent variable models still appears to be in its 
infancy, especially for models with noncontinuous responses. A recent book on 
diagnostics for linear mixed models (particularly growth curve models) is Pan 
and Fang (2002). For ordinary linear regression models there are several books 
on diagnostics that may also be useful for more general models, including 
Barnett and Lewis (1984), Belsley et al. (1980), Cook and Weisberg (1982) 
and Chatterjee and Hadi (1988). 

It is important to remember that the 4 final model’ may be misspecified, 
however carefully diagnosed and checked. When inference is required for par¬ 
ticular parameters such as treatment effects, it may therefore be advisable to 
investigate the sensitivity to model assumptions in some way. Bayesians some¬ 
times use model averaging, yielding credible intervals for the parameters that 
attempt to take model uncertainty into account. A less ambitious approach is 
sensitivity analysis where model assumptions are modified to investigate the 
4 robustness’ of the inferences. 

It should also be remembered that standard errors tend to underestimate 
uncertainty because model building is usually performed on the same data 
used to estimate parameters. In an extremely exploratory analysis, standard 
errors are therefore not presented at all, whereas they are generally taken 
at face value in somewhat exploratory analyses. We agree with Cox (1990, 
p.173) that standard errors should instead be interpreted as lower bounds in 
this case: 
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“Most applications are in any case somewhat in between the confirmatory- 
exploratory extremes and some notion, however approximate, of precision 
seems highly desirable in the exploratory portions of the analysis, if extremes 
of overinterpretation are to be avoided. The attachment of standard errors, 
etc. to the main features of an exploratory analysis, e.g. an exploratory 
multiple regression, seems often enlightening as indicating a minimum un¬ 
certainty. 
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Applications 


© 2004 by Chapman & Hall/CRC 



CHAPTER 9 


Dichotomous responses 


9.1 Introduction 

Since this is the first application chapter, we begin by discussing some of 
the classical models described in Chapter 3; a random intercept model in 
Section 9.2, a latent class model in Section 9.3, item response and MIMIC 
models in Section 9.4 and a random coefficient model in Section 9.5. The 
random coefficient model is used in the somewhat unusual context of a meta¬ 
analysis. A more common application for longitudinal data is described in 
Section 11.3 of the chapter on counts. 

The subsequent sections consider a wide variety of less conventional models. 
In Section 9.6 we model longitudinal data using models incorporating both 
state dependence and unobserved heterogeneity. In Section 9.7 we use capture- 
recapture models with unobserved heterogeneity to estimate population sizes. 
Finally, we consider multilevel item response models in Section 9.8. 

The applications discussed in this chapter come from a wide range of dif¬ 
ferent disciplines, namely education, clinical medicine, epidemiology, biology, 
economics and sociology or social psychology. 

9.2 Respiratory infection in children: A random intercept model 

Sommer et al. (1983) describe a cohort study of Indonesian preschool chil¬ 
dren examined up to six consecutive quarters for the presence of respiratory 
infection. 

Zeger and Karim (1991), Diggle et al. (2002) and others estimate a logistic- 
normal random intercept model for a subset of 275 of the children 1 . The model 
for the zth quarter and the jth. child can be written as 

logit[Pr(y i： ,-= l|xy,Ci)] = ^' i：j P + Cj, 

where 

Cj • 〜 N(0，# 

Zeger and Karim used the following covariates: 

• [Age] age in months (centered around 36) 

• [Xero] a dummy variable for presence of xeropthalmia, an ocular manifes¬ 
tation of chronic vitamin A deficiency 

1 The data can be downloaded from gllamm. org/books or Patrick Hegearty’s webpage 
http :/ /faculty.Washington.edu/heagerty/Books/AnalysisLongitudinal/xerop.data. 
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infection is related to [Age], season ([Cosine] and [Sine]) and [Height]. The 
odds ratio for [Age] is estimated as 0.967，corresponding to a decrease of 
the odds by 3.3% every month. This is the conditional effect of [Age] given 
the random intercept Q (and the other covariates). Estimates of marginal 
effects using generalized estimating equations (GEE) are given in the second 
set of columns of Table 9.1. Here the structure of the working correlation 
matrix was specified as exchangeable (see Display 3.3A on page 82) and the 
standard errors are based on the sandwich estimator described in Section 8.3.3. 
The marginal and conditional effect estimates are very similar here since the 
random effects variance is estimated as only 0.65, so that the attenuation 
factor is about 0.90 (see Section 4.8.1). 

Figure 9.1 displays the conditional and population averaged relationships 
between age and respiratory infection in the first quarter for girls who are not 
stunted, do not have xeropthalmia and whose height equals the average 1.83 at 


• [Cosine] cosine term of the annual cycle to capture seasonality 

• [Sine] sine term of the annual cycle to capture seasonality 

• [Female] a dummy variable for female gender 

• [Height] height for age as percent of the National Center for Health Statis¬ 
tics (NCHS) standard (centered at 90%), which indicates lower nutritional 
status 

• [Stunted] a dummy variable for stunting, defined as being below 85% in 
height for age. 

Maximum likelihood estimates for the random intercept model using 12- 
point adaptive quadrature are given in the left part of Table 9.1. Respiratory 

Table 9.1 Estimates for random intercept logistic model and GEE 
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the first quarter. The three dashed curves show the conditional relationships 
for Q = —0.8, 0 and 0.8 (from bottom to top). The circles show the population 
averaged relationship as estimated by GEE, whereas the solid curve coinciding 
with these circles represents the population averaged curve for the random 
intercept model, obtained by integrating over the random intercept, 


Ec 


[exp(x^ + Cj) I 
[1 + exp(x^./3 + 0) I Xy 


Here the random intercept model and GEE imply nearly the same marginal 
relationship between respiratory infection and [Age], although this need not 
always be so. 



Figure 9.1 Conditional and population averaged effects of [Age]. Circles: population 
averaged curve from GEE; solid curve: population averaged curve from random in¬ 
tercept model; dashed curves: conditional relationships for Cj = —0.8,0,0.8 

The random intercept variance is estimated as ^ = 0.650. Using the la¬ 
tent response formulation of the logit model (see Section 2.4), the corre¬ 
lation between the latent responses, conditional on covariates, is therefore 
•0/( 矽 + 7r 2 /3) = 0.165. The conditional correlations between the observed re¬ 
sponses depend on the covariate values. In contrast, GEE assumes a constant 
Pearson correlation between observed responses (conditional on the covari¬ 
ates), estimated as 0.045. 

Figure 9.2 shows a boxplot of the empirical Bayes predictions (see Sec¬ 
tion 7.3.1) of the children’s random intercepts (first boxplot). The distribution 
is skewed and there are some extreme values. However, with the important 
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exception of linear mixed models, the distribution of the empirical Bayes pre¬ 
dictions is generally not normal in generalized linear mixed models. Therefore 
it is difficult to judge whether the extreme values are a cause for concern 
(see Section 8.6.2). To assess this informally, we simulated the responses from 
the estimated model (keeping the covariate values from the data), estimated 
the parameters and predicted the random intercepts. We repeated this three 
times, giving the second to fourth boxplots in Figure 9.2. The boxplots for 
the simulated data resemble that for the real data, so there appears to be no 
cause for concern. 



Figure 9.2 Boxplots of empirical Bayes predictions of random intercept for data (first 
boxplot) and for three simulated datasets (boxplots 2 to 4) 

Figure 9.3 shows a graph of the predicted random intercepts versus their 
rank, for every fifth rank, with error bars representing 士 one posterior stan¬ 
dard deviation. 

9.3 Diagnosis of myocardial infarction: A latent class model 

Rindskopf and Rindskopf (1986) analyze data 2 from a coronary care unit in 
New York City where patients were admitted to rule out myocardial infarction 
(MI) or 4 heart attack’. 

Each of 94 patients was assessed on four diagnostic criteria: 

• [Q-wave] a dummy variable for presence of a Q-wave in the ECG 

2 The data can be downloaded from gllamm.org/books 
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Figure 9.3 Empirical Bayes predictions of random intercept versus rank with error 
bars representing ：L one posterior standard deviation 


• [History] a dummy for presence of a classical clinical history 

• [LDH] a dummy for having a flipped LDH 

• [CPK] a dummy for presence or absence of a CPK-MB 

The data are shown in Table 9.2. Since the patients have either had MI or not, 
it seems reasonable to specify two latent classes. Let ni be the probability of 
being in the first latent class, 

logit (7Ti) = Q 0 . 

If the second latent class corresponds to MI, the prevalence of MI is 兀2 = 1—7Ti. 
The conditional response probabilities can be specified as 
logit[Pr(?/ ii = i|b)J = e ic . 

The probabilities Pr(t/^- = l|c= 2) represent the sensitivities of the diagnos¬ 
tic tests (the probabilities of a correct diagnosis for people with the illness), 
whereas 1 — Pr(y^ = l|c= 1) represent the specificities (the probabilities of 
a correct diagnosis for people without the illness). Note that this model is 
equivalent to a two-class one-factor model since we can replace e^ c by Xie c , 
where Ai — 1. 

Parameter estimates are given in Table 9.3. The estimates en for [Q-wave] 
and e~42 for [CPK] take on very large negative and positive values correspond¬ 
ing to conditional response probabilities very close to 0 and 1， respectively, 
and therefore represent a so-called boundary solution. The corresponding stan¬ 
dard errors are extremely large. This is because the likelihood changes very 
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Table 9.2 Diagnosis of myocardial infarction data 


[Q-wave] 

(i = l) 

[History] 
(i = 2) 

[LDH] 
卜 3) 

[CPK] 

(i = 4) 

Obs. 

count 

Exp. 

count 

Prob. of 

MI (c=2) 

1 

1 

1 

1 

24 

21.62 

1.000 

0 

1 

1 

1 

5 

6.63 

0.992 

1 

0 

1 

1 

4 

5.70 

1.000 

0 

0 

1 

1 

3 

1.95 

0.889 

1 

1 

0 

1 

3 

4.50 

1.000 

0 

1 

0 

1 

5 

3.26 

0.420 

1 

0 

0 

1 

2 

1.19 

1.000 

0 

0 

0 

1 

7 

8.16 

0.044 

1 

1 

1 

0 

0 

0.00 

0.017 

0 

1 

1 

0 

0 

0.22 

0.000 

1 

0 

1 

0 

0 

0.00 

0.001 

0 

0 

1 

0 

1 

0.89 

0.000 

1 

1 

0 

0 

0 

0.00 

0.000 

0 

1 

0 

0 

7 

7.78 

0.000 

1 

0 

0 

0 

0 

0.00 

0.000 

0 

0 

0 

0 

33 

32.11 

0.000 


Source: Rindskopf and Rindskopf (1986) 


Table 9.3 Estimates for diagnosis of MI 



Class 1 (‘No MI，） 

Class 2 (種，） 

Parameter 

Est 

(SE) 

Prob. 

Est 

(SE) 

Prob. 




1-Spec. 



Sens. 

ei c [Q-wave] 

-17.58 

(953.49) 

0.00 

1.19 

(0-42) 

0.77 

e 2c [History] 

-1.42 

(0.39) 

0.30 

1.33 

(0.39) 

0.79 

e 3c [LDH] 

-3.59 

(1.01) 

0.03 

1.57 

(0.47) 

0.83 

e 4c [CPK] 

-1.41 

(0.41) 

0.20 

16.86 

(706.04) 

1.00 




1-Prev. 



Prev. 

Qo [Cons] 

0.17 

0.22 

0.54 


- 

0.46 
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little when these extreme parameters change, as the predicted probabilities 
remain close to 0 and 1 for a large range of values. For example, logits of 5 
and 20 correspond to probabilities of 0.993 and 1.000, respectively. 

Following Rindskopf and Rindskopf (1986), we will nevertheless interpret 
these parameter estimates. For each set of test results in Table 9.2, the ex¬ 
pected count was obtained by multiplying the likelihood contribution for a 
patient with these test results, 

g(yj) = Ti-i n Pr (yy l c =!) + ^ n Pr (yn\ c = 2 ) > 


by 94, the number of patients. Comparing the expected counts with the ob¬ 
served counts in Table 9.2, the model appears to fit well. 

From Table 9.3, the prevalence of MI is estimated as 0.46. The specificity 
of [Q-wave] is estimated as 1, implying that all patients without MI will have 
a negative result on that test. [History] has the lowest specificity of 0.70. The 
estimated sensitivities range from 0.77 for [Q-wave] to 1.00 for [CPK], so that 
77% of MI cases test positively on [Q-wave] and 100% on [CPK]. 

We can obtain the posterior probabilities (similar to 4 positive predictive 
values’）of MI given the four test results using Bayes theorem (see Section 7.2), 


= Pr(c= 2| yi )= 


_ 兀 2rL pr faji c = 2 )_ 

TTi rii Pr (2/ij |c = 1) + 7T2 rii Pr(2/y|c= 2) 


These probabilities are presented in the last column of Table 9.2, where bold¬ 
face means that a patient with these test results has a higher posterior proba¬ 
bility of being in class 2 than class 1 and is therefore diagnosed as MI using the 
empirical Bayes modal classification rule (see Section 7.4). For most patients 
the diagnosis (classification) is very clear with posterior probabilities close to 
0 and 1. For each patient we can work out the probability of misclassification 
(using the empirical Bayes modal classification rule) as 


/i = 1 - ^(e c \ yj -,9), 


see (7.13) on page 237. For instance, for y ■ = (0,1，0,1)，the patient would 
be classified as l no myocardial infarction’ and the probability of misclassifi¬ 
cation would be 0.42. This is the only test result with a large probability of 
misclassification and is fortunately expected to occur for only 3.26/94 = 3.5% 
of patients. 

We can estimate the proportion of classification errors in the population 
using the sample average of fj, giving 0.030. If we had no test results, we 
would have to diagnose patients according to the prior probabilities. Ev¬ 
eryone would be diagnosed as c no myocardial infarction, since this is more 
likely (7ri = 0.54) than myocardial infarction (丌2 = 0.46). The estimated 
probability of misclassification would be 0.46. The estimated proportional 
reduction of classification error due to knowing the test results is therefore 
(0.4579 — 0.0296)/0.4579 = 0.94. If we use the expectation of fj instead of the 
sample average (using model-based expected frequencies instead of the ob- 
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served frequencies), the proportional reduction of classification error becomes 
0.95. Use of covariate information such as age and sex would be likely to im¬ 
prove diagnostic accuracy even further; see Section 13.5 for an example of a 
latent class model with covariates. 


9.4 Arithmetic reasoning: Item response models 

We will analyze data 3 from the Profile of American Youth (U.S. Department 
of Defense, 1982), a survey of the aptitudes of a national probability sample 
of Americans aged 16 through 23. The data for four items of the arithmetic 
reasoning test of the Armed Services Vocational Aptitude Battery (Form 8A) 
are shown in Table 9.4 for samples of white males and females and black males 
and females. These data were previously analyzed by Mislevy (1985). 


Table 9.4 Arithmetic reasoning data 



Item 

Response 


White 

White 

Black 

Black 

1 

2 

3 

4 

Males 

Females 

Males 

Females 

0 

0 

0 

0 

23 

20 

27 

29 

0 

0 

0 

1 

5 

8 

5 

8 

0 

0 

1 

0 

12 

14 

15 

7 

0 

0 

1 

1 

2 

2 

3 

3 

0 

1 

0 

0 

16 

20 

16 

14 

0 

1 

0 

1 

3 

5 

5 

5 

0 

1 

1 

0 

6 

11 

4 

6 

0 

1 

1 

1 

1 

7 

3 

0 

1 

0 

0 

0 

22 

23 

15 

14 

1 

0 

0 

1 

6 

8 

10 

10 

1 

0 

1 

0 

7 

9 

8 

11 

1 

0 

1 

1 

19 

6 

1 

2 

1 

1 

0 

0 

21 

18 

7 

19 

1 

1 

0 

1 

11 

15 

9 

5 

1 

1 

1 

0 

23 

20 

10 

8 

1 

1 

1 

1 

86 

42 

2 

4 



Total: 

263 

228 

140 

145 


Source.. Mislevy (1985) 


We first estimate a one-parameter logistic item response model (see Sec¬ 
tion 3.3.4) for item i and subject 

logit[Pr (识 ）=1|%)] = Pi + iy. 


The data can be downloaded from gllaimn.org/books 


© 2004 by Chapman & Hall/CRC 





0.58 

0.24 

- 0.22 


Intercepts 
0i [Iteml] 

02 [Item2] 

0s [Item3] 

8a [Item4] 

Factor loadings 
Ai [Iteml] 1 

入 2 [Item2] 1 

入 3 [Item3] 1 

入 4 [Item4] 1 

Guessing parameter 


Variance 

Log-likelihood 


The estimated item characteristic curves for the one and two-parameter 


The parameter estimates are given in Table 9.5 where we note that the esti¬ 
mated item difficulties f3i increase from item 1 to item 4. This model assumes 
that the effect of increasing ability is the same for all items (on the logit 
scale), an assumption that can be relaxed using the two-parameter logistic 
item response model 

logit [Pr(y^- = l\7]j)] = pi + 

where we set 入 i = 1 for identification. The model can be written in GRC 
formulation as 

logit[Pr(t/y = l\r]j)] = d'i/3 + rj^X, (9.1) 

where is a four-dimensional vector with zth element equal to 1 and all 
other elements equal to 0. The parameter estimates are also given in Table 9.5 
where we see that the estimated discrimination parameters or factor loadings 
Xi for items 2 and 3 are lower than for the other two items. However, the 
two-parameter model does not fit much better than the one-parameter model. 


Table 9.5 Estimates for one, two and three-parameter item response models using 
20-point adaptive quadrature 


One-parameter Two-parameter Three-parameter 
Parameter Est (SE) Est (SE) Est (SE) 


幻 S'S'S'ST78 
114 7 2 3 9 o 
(0.(0.(0.(3. |(0.(0. (1. I(3.92 
-1 

5 4 4 7 6 7 2 2 6 
0 5 6 0 6 9 4 2 6 
0.0.1.5.10.0.2.o.6. 

---- 

幻 ST9 1 76 

.1.0 . 0.1 I .1 . 1.2 I .82 . 

(o.(o.(o.(o.(o.(o.(o.(o. 002 

-2 

4 2 2 3 7 3 3 7 
.6.2.2.6.6.7.9.4 

0.0.0.0.lo.o.o._2. 

I I 

S'S'STS' 94 

.1.1.0.1-II- -. 2 4 

2^ o 
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item response models are given by 



where the linear predictor Vij is + rjj for the one-parameter model and 
Pi + XiVj for the two-parameter model. These curves tend to 0 as ability tends 
to — 00 . 

If it is possible to guess the right answer, as in multiple choice questions, 
a more realistic model is the three-parameter logistic item response model 
which can be written as 



Here, q is often called a 4 guessing parameter’ which can be interpreted as the 
probability of a correct answer from a subject with ability minus infinity. This 
model does not fit into the general model framework described in Chapter 4 
since the response model is not a generalized linear model (conditional on the 
latent variable). However, if we fix the guessing parameter to some constant, 
for example 0.1, the response model can be expressed as a generalized linear 
model with a composite link (see Section 2.3.5, equation (2.14)), 


^ r (yij = MVj) = O-l^i 1 (1) +0.9 分 2 1 (%) ， 


where gi is the identity link and g2 is the logit link. Assuming that the guessing 
parameter is the same for all items, we tried different values of c (from 0 
to 0.4 in steps of 0.02) and maximized the likelihood with respect to the 
other parameters, giving the profile log-likelihood. This profile log-likelihood 
is plotted against c in Figure 9.4 and has a maximum at c=0.22. Approximate 
95% confidence limits are those values of c where the profile log-likelihood is 
3.84/2 lower than the maximum as indicated by the horizontal dotted line in 
the figure. The approximate 95% confidence interval for c therefore is from 
0.14 to 0.28. 

The parameter estimates for the three-parameter logistic item response 
model with c = 0.22 are given in Table 9.5. Note that the standard errors 
are underestimated because c is treated as known. The model fits substan¬ 
tially better than the two-parameter model. Unfortunately, the parameter 
estimates are not very reliable because the likelihood appears to be somewhat 
flat. In particular, the correlation between the estimates /?4 and 入 4 is esti¬ 
mated as —0.95. Furthermore, different starting values lead to quite different 
estimates but very similar log-likelihood values. Empirical identification (see 
Section 5.2.5) thus appears to be . 



The item characteristic curves for all three models are shown in Figure 9.5. 
Unlike the one-parameter model, the curves for the two-parameter model in¬ 
tersect with items 1 and 4 having higher slopes than items 2 and 3, clearly vio¬ 
lating double monotonicity. It is clear that the curves for the three-parameter 
model approach an asymptote of 0.22 as ability tends to —oo. 
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0.0 0.1 0.2 0 - 22 0.3 0.4 

Guessing parameter c 


Figure 9.4 Profile log-likelihood for guessing parameter in three-parameter logistic 
item response model 

Returning to a two-parameter item response model, we now consider the 
following covariates: 

• [Female] a dummy variable for subject being female 

• [Black] a dummy variable for subject being black 

We can specify a structural model for ability rjj, allowing the mean abilities 
to differ between groups, 

Vj = 7o + liFj + l2Bj + + Cj, 

where Fj represents [Female] and Bj [Black]. Since we have included a constant 
in the structural model, we have to fix one of the constants in the response 
model for identification and set /?i = 0. This is a MIMIC model of the kind 
discussed in Section 3.5 where the covariates affect the response via a latent 
variable only. 

Table 9.6 gives parameter estimates for this model (^ 2 ) and the model 
without covariates, 71=72 = 73 = 0 (*Mi)，which is equivalent (see Section 5.3) 
to the simple two-parameter item response model in Table 9.5. Deviance and 
Pearson X 2 statistics are also reported in the table, from which we see that 
M2 fits better than M\. The variance estimate of the disturbance decreases 
from 2.47 for A^i to 1.88 for M.2 because some of the variability in ability is 
‘explained’ by the covariates. There is some evidence for a [Female] by [Black] 
interaction. While being female is associated with lower ability among white 



pooqnsn-so i—laaOJ j 
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Table 9.6 Estimates for MIMIC models 



Mi 

m 2 

AI3 

Parameter 

Est (SE) 

Est (SE) 

Est 

(SE) 

Intercepts 

Pi [Iteml] 

0 

0 

0 


/?2 [Item2] 

Ps [Item3] 

-0.21 (0.12) 

-0.22 (0.12) 

-0.13 

(0.13) 

-0.68 (0.14) 

-0.73 (0.14) 

-0.57 

(0.15) 

/?4 [Item4] 

/?5 [Iteml] x 

-1.22 (0.19) 

-1.16 (0.16) 

-1.10 

(0.18) 

[Black] x [Female] 

0 

0 

-1.07 

(0.69) 

Factor loadings 

Ai [Iteml] 

1 

1 

1 


A2 [Item2] 

0.67 (0.16) 

0.69 (0.15) 

0.64 

(0.17) 

A3 [Item3] 

0.73 (0.18) 

0.80 (0.18) 

0.65 

(0.14) 

A4 [Item4] 

0.93 (0.23) 

0.88 (0.18) 

0.81 

(0.17) 

Structural model 

7o [Cons] 

0.64 (0.12) 

1.41 (0.21) 

1.46 

(0.23) 

71 [Female] 

0 

-0.61 (0.20) 

-0.67 

(0.22) 

72 [Black] 

0 

-1.65 (0.31) 

-1.80 

(0.34) 

73 [Black] x [Female] 

0 

0.66 (0.32) 

2.09 

(0.86) 


2.47 (0.84) 

1.88 (0.59) 

2.27 

(0.74) 

Log-likelihood 

-2002.76 

-1956.25 

-1 

954.89 

Deviance 

204.69 

111.68 


108.96 

Pearson X 2 

190.15 

102.69 


100.00 


people, this is not the case among black people where males and females have 
similar abilities. Black people have lower mean abilities than both white men 
and white women. 

We can also investigate if there are direct effects of the covariates on the 
responses, in addition to the indirect effects via the latent variable. This could 
be interpreted as ‘item bias’ or ‘differential item functioning’ (DIF), i.e., where 
the probability of responding correctly to an item differs for instance between 
men and women with the same ability. Such item bias would be a problem since 
it suggests that candidates cannot be fairly assessed by the test. Bartholomew 
(1987, 1991) found that the black women performed worse on the first item. 
We will investigate whether this is the case after allowing for group differ¬ 
ences in mean ability by adding the term /3^FjBjdn to (9.1). The parameter 
estimates are given under A4s in Table 9.6. Here there is no evidence that 
item 1 functions differently for black females. See also Section 10.3.3 for an 
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investigation of item bias in two-dimensional ordinal item response models for 
political efficacy. 

Note that none of the models appear to fit well according to absolute fit cri¬ 
teria. For example, for M2, the deviance is 111.68 with 53 degrees of freedom, 
although the table is perhaps too sparse to rely on the % 2 distribution. 

We can predict people’s abilities on the basis of their responses to the four 
questions using empirical Bayes as described in Section 7.3.1. The empirical 
Bayes predictions for all possible response patterns are given in Table 9.7 for 
M.\ and M2. The abilities can be interpreted as logits of the probability of 
a correct response to item 1 (since /?i =0 and Ai = 1)，with 0 corresponding 
to a probability of 50% and 士 1 to probabilities 73% and 27%. For the 
predicted abilities depend on group, with for instance black males given lower 
4 scores’ for the same performance than white males since black males have a 
lower mean ability than white males. Statistically, these predictions may be 
better than those ignoring the covariates, but the scoring method certainly 
does not appear to be fair! 


Table 9.7 Empirical Bayes predictions of ability 

Item M\ _ M2 _ 

Response All White White Black Black 

1 2 3 4 Groups Males Females Males Females 


1183837490424256 
. 4 . 5 . 5 . 2 . 6 . 1 . 0 . 8 . 3 . 4 . 3 . 1 . 2 . 0 . 9.7 
1 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 1 . 0 . 1 . 0 .^ 
-I I - I 

3301114228192933 
. 4 . 5 . 6 . 2 . 7 . 1 . 0 . 8 . 4 . 3 . 3 . 0 . 2 . 9 . 9.7 
1 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 0 . 1 . 0 . 0 . 0 .• 
- I I - I 

5290803297001039 
. 8 . 0 . 0 . 7 . 1 . 6 . 5 . 3 . 0 . 8 . 8 . 6 . 7 . 5 . 4.2 
0.0.0.0.0.0.0.1.0.0.0.1.0.1.1.-」 

Ill I 


3708182286919034 
. 5 . 2 . 2 . 9 . 1 . 8 . 8 . 6 . 3 . 1 . 0 . 9 . 9 . 8 . 7.6 
0 . 0 . 0 . 0 . 0 . 0 . 0 . 1 . 0 . 1 . 1 . 1 . 0 . 1 . 1 . 2 . 


8342076150964195 
. 1 . 1 . 36 . 4533.09666634 
1 . 0 . 0 . 0 . 0 . 0 . 0.1 . 0 . 0 . 0.1 . 0.1 .1 . 2 . 
-I I I I 
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9.5 Nicotine gum and smoking cessation: A meta-analysis 

9.5.1 Introduction 


Systematic reviews of available evidence regarding the efficacy of medical 
treatments are of obvious importance for informing clinical practice. Such 
reviews have become increasingly popular, forming a vital part of the gospel 
of 4 evidence based medicine’. Sackett et al. (1991) give the following advice to 


“If a rigorous scientific overview has been conducted on the clinical question 
you are attempting to answer, your time is better spent studying it rather than 
a grab (and perhaps distorted) sample of its citations.” 

The importance of systematic reviews is reflected in the formation in 1993 of 
the c Cochrane Collaboration’，which produces and updates vast numbers of 
reviews of clinical trials for most areas of medical research as well as setting 
up guidelines, offering courses, etc. 

Met a-analysis is the statistical approach to combining evidence from differ¬ 
ent studies to obtain an overall estimate of treatment effect. Although modern 
met a-analysis originates in education and psychology (e.g. Glass, 1976; Hunt, 
1997), its recent proliferation in medical research has led to an upsurge of 
interest within biostatistics. 

Here we discuss meta-analysis of clinical trials of nicotine replacement ther¬ 
apy for smoking cessation using data 4 from Silagy et al. (2003). Following 
Silagy et al” we carry out a separate analysis of studies using nicotine gum 
(rather than for instance nicotine patches) combined with a high level of sup¬ 
port, including formal therapy or 4 assessment and reinforcement’ visits. 

In each trial, patients were randomized to a treatment group given nicotine 
gum or a control group. In most studies the control group received placebo 
gum which had the same appearance as the nicotine gum but lacked the active 
ingredient nicotine. In some studies, the control group had no gum. Smoking 
cessation at least 6 months after treatment was the outcome considered. The 
most rigorous definition of abstinence for each trial was used. The results 
for the trials can be summarized by two-by-two tables (treatment arm by 
outcome), which can be derived from the rows of Table 9.8. 

We will consider estimation of the overall odds ratio, the odds of quitting 
smoking if treated divided by the odds of quitting if not treated. For an 
individual study this odds ratio can be estimated as 


(hj / (jiij — d\j) 

doj/(n 0 j - d 0 j) 


(9.2) 


where d±j and doj are the numbers of quitters in the treatment and control 
groups, respectively, whereas nij and noj are the total numbers of subjects 
in these groups. Other measures of treatment effect include the risk ratio and 
the risk difference. 


4 The data can be downloaded from gllamm. org/books 
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Table 9.8 Randomized studies of nicotine gum and smoking cessation 



Treated 

Control 

Study 

Quitters 

di 

Total 

ni 

Quitters 

do 

Total 

no 

Blondal 1989 

37 

92 

24 

90 

Campbell 1991 

21 

107 

21 

105 

Fagerstrom 1982 

30 

50 

23 

50 

Fee 1982 

23 

180 

15 

172 

Garcia 1989 

21 

68 

5 

38 

Garvey 2000 

75 

405 

17 

203 

Gross 1995 

37 

131 

6 

46 

Hall 1985 

18 

41 

10 

36 

Hall 1987 

30 

71 

14 

68 

Hall 1996 

24 

98 

28 

103 

Hjalmarson 1984 

31 

106 

16 

100 

Huber 1988 

31 

54 

11 

60 

Jarvis 1982 

22 

58 

9 

58 

Jensen 1991 

90 

211 

28 

82 

Killen 1984 

16 

44 

6 

20 

Killen 1990 

129 

600 

112 

617 

Malcolm 1980 

6 

73 

3 

121 

McGovern 1992 

51 

146 

40 

127 

Nakamura 1990 

13 

30 

5 

30 

Niaura 1994 

5 

84 

4 

89 

Niaura 1999 

1 

31 

2 

31 

Pirie 1992 

75 

206 

50 

211 

Puska 1979 

29 

116 

21 

113 

Schneider 1985 

9 

30 

6 

30 

Tonnesen 1988 

23 

60 

12 

53 

Villa 1999 

11 

21 

10 

26 

Zelman 1992 

23 

58 

18 

58 


Source: Silagy et al. (2003) 


9.5.2 Approaches to meta-analysis 

There are essentially two different approaches to meta-analysis: fixed effects 
and random effects. Fixed effects meta-analysis assumes that there is a single 
true treatment effect and that any variability between the studies，estimated 
treatment effects is completely due to within-study sampling variability. Note 
that this use of the term 4 fixed effects’ is somewhat misleading since the term 
usually implies that there is a fixed effect for each cluster (here study); see 
Section 3.6.1. The assumption of a common treatment effect is often tested 


© 2004 by Chapman & Hall/CRC 






using Cochran’s Q-test of homogeneity (e.g. Cochran, 1950; DerSimonian and 
Laird, 1986). 

In contrast, random effects meta-analysis assumes that the true treatment 
effect varies between studies. This variation could be due to differences in 
populations and trial protocols including drug dosage, duration of treatment, 
definition and measurement of outcomes and length of follow-up. The aim of 
the meta-analysis then becomes to estimate the mean treatment effect for an 
imagined population of studies. 

Fleiss (1993) and Bailey (1987) discuss two considerations for choosing be¬ 
tween the two competing approaches. First, the random effects approach at¬ 
tempts to generalize conclusions to the population of studies including future 
studies, whereas the fixed effects approach restricts conclusions to the studies 
contributing to the analysis. Second, the random effects explicitly allow for 
study-to-study variation in contrast to the fixed effects approach. In many 
cases, the studies differ from one another so fundamentally that it might be 
easier to argue that it is nonsensical to pool the effect sizes at all than that 
there is a single true treatment effect. 

Here we adopt the random effects approach because we consider it unlikely a 
priori that the treatment effects do not differ between the studies. For instance, 
the ‘Blondal 1989’ study used gum containing 4mg of nicotine for one month 
whereas ‘Barcia 1989, used gum containing 2mg of nicotine for three to four 
months. The former study considered 12 months sustained abstinence whereas 
the latter considered 6 months sustained abstinence. The studies also differed 
in the nature and intensity of additional support and in the types of smokers 
considered. For instance, 4 Campbell 1991’ treated only patients with smoking- 
related illnesses whereas most other studies treated any smokers interested in 
quitting. Furthermore, studies were conducted in a wide range of countries 
including Iceland, Sweden, Spain and the USA. 

The predominant approach to meta-analysis, whether fixed effects or ran¬ 
dom effects, is to analyze the estimated study-level treatment effects instead of 
the original patient-level data. When the effect size of interest is an odds ratio 
as here, log-odds ratios are often analyzed instead of odds ratios since their 
sampling distribution is likely to be better approximated by a normal distri¬ 
bution. Random effects meta-analysis then consists of estimating the following 
linear random intercept model, 


ln(Oj) = /?o + Coj + 勺， ej •〜 N(0, 心） （9.3) 

where Oj is the estimated odds ratio for study j as defined in (9.2)，/?o is the 
mean log odds ratio of interest and (oj, the random intercept, represents the 
deviation of the study’s true log odds-ratio from the mean log-odds ratio. The 
within-study standard deviations y/6j are simply set equal to the standard 
errors of the log-odds ratios estimated using Woolf’s method (Woolf, 1955), 





nij - d-ij 


(9-4) 
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Recently, 4 meta-regression’ where the heterogeneity between studies is Ex¬ 
plained 5 by including study-specific covariates such as drug dosage in (9.3) 
has attracted considerable attention (e.g. Berkey et al” 1995; van Houwelin- 
gen et al, 2002). 

Unfortunately, analyzing study-level estimates of effect size is problematic 
because the normality assumption will often be violated. In the case of log odds 
ratios, this will be the case when the studies are small and/or the outcome of 
interest is rare. If do and/or d\ are zero this approach requires ad-hoc practices 
such as adding 0.5 to the counts (a practice also recommended by Gart and 
Zweifel (1967) for small counts to reduce bias). It is therefore preferable to 
model the observed patient-level dichotomous responses directly. Surprisingly, 
analysis of study-level estimates is common not just in applied papers, but also 
in methodological work (e.g. Normand, 1999), including Bayesian treatments 
(see e.g. Carlin, 1992; DuMouchel et al, 1996; Gelman et al, 2003) where 
‘proper’ modeling using Markov chain Monte Carlo methods is straightfor¬ 
ward. Analyzing study-level estimates of effect sizes is perhaps justified only 
if patient-level data is not available (see also Chalmers, 1993). 


9.5.3 Random effects modeling of patient-level data 


For patient-level data, where we let i index patients and j index studies, 
Agresti and Hartzel (2000) consider the random coefficient model 


logit(Pr Co』'，Cij) = Po H - Pi^ij Coj H - ， 


where 


and 


{ 0.5 for treated patients 
—0.5 for control patients 


(C 0 j ， W~n 2 (o ，). 

Here po and (oj are fixed and random intercepts, respectively, and and Cij 
fixed and random slopes of Xij. Pi represents the log odds ratio of interest 
whereas 01 + (ij represents the ‘true’ log odds ratio of study j. The study- 
specific intercepts are sometimes treated as fixed; see for instance Turner et 
al. (2000) and Thompson et al. (2001). 

Agresti and Hartzel assume that the random intercept and slope are un¬ 
correlated with ^io = 0. Note that the coding of Xij becomes important in 
this case; a model with for instance = 0,1 would not be equivalent to a 
model with = —0.5,0.5 (see page 54). Agresti and Hartzel argue that an 
advantage of the latter coding or 4 centering’ is that with 岭 10 = 0 the total 
variance Var(Coj + Cij x ij) the log odds is the same for both groups. 

We will nevertheless investigate the validity of the assumption of zero corre¬ 
lation for the nicotine gum data. The models were estimated by maximum like¬ 
lihood using 20-point adaptive quadrature. The correlation between random 
intercept and slope was estimated as 0.17 and the difference in log-likelihoods 
between the models allowing for a correlation and the model with zero corre- 
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Table 9.9 Empirical Bayes, full Bayes and NPML estimates 



Empirical 

Bayes 

Full 

Bayes 

NPMLE 

Parameter 

Est 

(SE) 

Est 

(SE) 

Est 

(SE) 

Fixed part 

Po [Cons] 

/3i [Treat] 

-1.16 

(0.14) 

-1.17 

(0.15) 

-1.20 

(0.15) 

0.57 

(0.09) 

0.59 

(0.09) 

0.59 

(0.10) 

Random part 
V^oo [Cons] 

0.70 

(o.ii) 

0.73 

(0.12) 



y/Wri [T^eat] 

0.22 

(0.10) 

0.20 

(0.10) 




lation was only 0.05. Thus, we present estimates for the model with 0 10 = 0 
under ‘Empirical Bayes，(maximum likelihood) in Table 9.9. There appears to 
be clear evidence that nicotine gum increases the odds of quitting, with an 
estimated odds ratio of exp(0.57) = 1.77. There is some heterogeneity in the 
overall prevalence of quitting, reflected in the estimate = 0.70 and small 
variability in the treatment effects estimated as = 0.22 

We also considered a fully Bayesian approach using noninformative priors as 
described in Section 6.11. The prior distributions of /?o and were specified as 
N(0,10 6 ) and the priors of Coj and as N(0, ^oo) and N(0, *0ii)，respectively. 
The hyperpriors of ^oo and were specified as IG(0.001,0.001), where IG 
is the inverse gamma density given on page 210. Note that this specification 
is similar to that used for a different meta-analysis in Chapter 10 of Volume 
1 of the BUGS Examples Manual (Spiegelhalter et al” 1996b). The difference 
is that we treat the study-specific intercept as a random effect with a prior 
and hyperprior for the variance whereas the manual treats it as a ‘fixed effect’ 
with no hyperprior. Gibbs sampling, as implemented in BUGS, was used to 
estimate the parameters (see Section 6.11.5). A burn-in of 10000 iterations was 
used and the means and standard deviations of the parameters were obtained 
from a further 10000 iterations. The results are given in Table 9.9 under ‘full 
Bayes，and agree quite closely with the empirical Bayes (maximum likelihood) 
results. 

We could use (9.2) and (9.4) to estimate the individual log odds ratios and 
standard errors for each study. However, if we believe in the Bayesian random 
effects model, all inferences regarding the individual log odds ratios Pi + Cij 
should be based on their marginal posterior distribution, integrating over all 
other model parameters, here /3q, ^oo and ^n- Within an MCMC algorithm 
(see Section 6.11.5)，this amounts to using the empirical distributions of the 
sampled log odds ratios + . In empirical Bayes, the conditional poste¬ 

rior distribution of the (^ij is used, given that the other parameters are equal 
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to the maximum likelihood estimates (see Section 7.3.1). The predicted log 
odds ratios then become /3i + Cij , where Cij is the empirical Bayes prediction 
(mean of conditional posterior). 

Bayesians use credible intervals instead of confidence intervals. For a 95% 
credible interval, the posterior probability that the parameter lies in the inter¬ 
val is 95%. Figure 9.6 shows approximate Bayes 95% credible intervals (using 
estimated percentiles of the posterior distribution based on 10000 draws) of 
the true effect + Cij each study, as well as the empirical Bayes coun¬ 
terparts. Unlike the fully Bayesian intervals, the empirical Bayes intervals are 
derived by treating the model parameters as known and assuming that the 
posterior distribution of the random slopes is normal. Here the intervals are 
constructed as posterior mean zb 1.96 times the posterior standard deviation. 
We would expect the fully Bayes intervals to be wider since they attempt to 
account for parameter uncertainty. However, the differences are generally very 
small with the possible exception of the c Huber 1988’ study. The raw log odds 
ratios ln(oj), shown as 4 x’s，tend to be further from the average log odds, 
shown as a solid vertical line, than the empirical and full Bayes predictions. 
This is due to shrinkage as discussed in Section 7.3.1. 

Instead of assuming bivariate normality for the random intercept and slope, 
Aitkin (1999b) leaves their joint distribution unspecified by using nonpara- 
metric maximum likelihood estimation (NPMLE) (see Section 4.4.2). Here a 
discrete distribution is used with locations Coj = eo CJ Cij = e ic and masses or 
probabilities 7r c , c = 1 ， … ， C，giving a mixture regression model. The number 
of masses C is determined to maximize the likelihood. Note that the inter¬ 
cepts and slopes are now no longer independent. Using the Gateaux derivative 
method described in Section 6.5, where the number of masses is increased one 
at a time until the derivative is negative, we found that C=10. 

The estimates of the mean intercept and slope, given in Table 9.9 under 
NPMLE, are remarkably similar to the estimates assuming bivariate normality 
and #10 = 0. The log-likelihood for NPMLE was —3061.5 compared with 
—3074.2 assuming bivariate normality and = 0. In NPMLE, the variances 
and covariances for the random effects are not model parameters but can be 
derived from the discrete distribution. The standard deviation of the random 
intercepts was 0.76, the standard deviation of the random slopes 0.30 and the 
correlation 0.08. 

Figure 9.7 shows the NPMLE masses. In the top panel, the locations of the 
circles are eb C5 ei c whereas the areas are proportional to the probabilities 7r c . In 
the bottom panel, the probabilities are instead shown as the heights of spikes. 
Figure 9.8 shows the log odds for the control and treatment (gum) groups for 
each of the locations. The thickness of the lines reflects the probabilities which 
are also shown in the figure as percentages. 

In summary, all three approaches considered produce practically identical 
estimates. The overall conclusion is that nicotine gum increases the odds of 
smoking cessation by about 80%. The 95% confidence and credible intervals 
for the odds-ratio (derived from the estimated log odds ratio and its standard 
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error) are nearly the same for all three methods, the interval for empirical 
Bayes was from 1.5 to 2.1. There does not appear to be much heterogeneity 
in the treatment effect between studies. 

It is important to note that met a-analysis is not without its critics, see for 
instance Thompson and Pocock (1991) and Oakes (1993). One problem that is 
generally acknowledged is publication bias. This is due to small studies being 
difficult to publish if the findings are not significant, leading to overestimated 
treatment effects in the meta-analysis (e.g. Sterlin, 1959; Sutton et al., 2000). 
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Figure 9.8 Predicted log odds by treatment group for each ‘class’ 


9.6 Wives’ employment transitions: Markov models with 
unobserved heterogeneity 

We will now analyze data 5 , 6 on wives，employment from a panel survey or re¬ 
peated measurement study. The 4 Social Change and Economic Life Initiative’， 
described in Davies et al. (1992) and Davies (1993)，followed the employment 
status of wives on a monthly basis from their month of marriage to the survey 
month in 1987. Here, we consider a subsample from Rochdale, one of the six 
localities studied. 

The response is whether a wife is in paid employment (state 1) or not (state 
0). The explanatory variables are all time-varying: 

• [HUnemp] a dummy variable taking the value 1 if the wife’s husband is 
unemployed and 0 otherwise 

• [Time] the month of observation since the beginning of the study 

• [Child 1] a dummy for the wife having children under the age of 1 

• [Child5] a dummy for the wife having children under the age of 5 

• [Age] the wife’s age in years 

• [Agesq] the wife’s age in years squared 

There are two competing explanations of the empirical regularity that peo¬ 
ple having experienced an event in the past are more likely to experience 
the event in the future than others. One explanation is ‘causal’； employment 

5 We thank Dave Stott for providing us with these data. 

6 The data can be downloaded from gllamm. org/books 
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changes people, having been employed in itself changes a wife’s future proba¬ 
bility of employment, inducing dependence over time. This is called true state 
dependence by Heckman (1978, 1981ac). This causal explanation is very com¬ 
mon but is often naive because an alternative explanation may be just as plau¬ 
sible. Here, the observed dependence over time is interpreted as having been 
induced by permanent components that represent unobserved heterogeneity. 
Thus, the higher probability of being employed could be due to unobserved 
characteristics of the wives, some having a high probability of future employ¬ 
ment regardless of their previous employment history. This is called spurious 
state dependence by Heckman who argues that distinguishing between true 
and spurious state dependence is crucial in observational studies. 

Consider first a conventional first-order discrete-time Markov model, where 
employment status ytj at time t is conditionally independent of employment 
history given the previous employment status yt-i,j, and current covariates. 
Using a logit link, the model can be written as 


^{ytj = 1 l x tj ? yt-i,j) = 


exp(x’ tj /3 + 72 /t-ij) 

1 + exp(x’^ + 72/t-ij) 


Unobserved heterogeneity can be included in the Markov model using a 
random intercept Q (e.g. Heckman, 1981a) 

i~Cj) 


P r = 1 l x *j ，， Cj) = 


exp(x； j /3 + 7^_ iJ H 
1 + exp(x^-/3 + 72/t-i,j + 0) * 


According to this model there is true state dependence if 7 一 0 and spurious 
state dependence if 7 = 0 . 

We can also allow the random intercept variance to depend on the previous 
state by specifying a factor model, 


and 


Pr(ytj = l l x it ， yt-i,j = Q ， Cj) 


exp(x , tj ^ + Q) 



P r (%?_ = 1 l x o ， 0) 


exp(x^/3 + 7 + Kj) 
1 + exp(x , tj /3 + XCj) 


The first probability is conditional on the subject previously being in state 0, 
whereas the second is conditional on being in state 1. The other two transition 
probabilities are Pr(y tj =0\y t -i,j=0) = 1 - Pr(y tj = 1 | 讲一丄 ，】• = 0 ) and Pr(y tj = 
0\y t -ij = 1) = 1 — Pr(y t j = l\y t -ij = 1). The factor Q represents subject- 
specific unobserved heterogeneity for the transition process, with variance 
A 2< 0 when yt-i,j = 1 and variance ^ when yt-ij = 0. 

The parameter estimates for these three models, denoted Mi, M2 and 
AI 3 , respectively, are given in Table 9.10. The largest effect estimates are for 
[Childl] and [HUnemp], both variables decreasing the odds of wives’ employ¬ 
ment. A negative effect of having children under the age of one on employment 
is hardly surprising. The negative effect of [HUnemp] might for instance be due 
to a considerable reduction in the husband’s unemployment benefits if his wife 
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Table 9.10 Estimates for simple Markov models with and without unobserved het¬ 
erogeneity 



Mi 

m 2 

M3 

Parameter 

Est (SE) 

Est (SE) 

Est 

(SE) 

Fixed part 

/ 3 o [Cons] 

-1.277 (0.240) 

-1.114 (0.299) 

- 1.222 

(0.339) 

/?i [HUnemp] 

- 1.406 (0.373) 

-1.512 (0.412) 

-1.524 

(0.388) 

/?2 [Time] 

-0.012 (0.025) 

- 0.011 (0.026) 

0.006 

(0.032) 

/3 3 [Childl] 

-3.008 (0.392) 

-2.953 (0.408) 

-2.767 

(0.409) 

/? 4 [Child5] 

-0.165 (0.253) 

-0.241 (0.272) 

-0.383 

(0.294) 

P5 [Age] 

-0.005 (0.014) 

0.000 (0.016) 

-0.009 

(0.018) 

/?6 [Agesq] 

- 0.001 ( 0 . 001 ) 

- 0.001 ( 0 . 001 ) 

- 0.002 

( 0 . 001 ) 

7 [Lag] 

4.391 (0.209) 

4.226 (0.264) 

4.380 

(0.326) 

Random part 


- 

0.308 (0.326) 

2.177 

(0.932) 

A 

- - 

- - 

-0.119 

(0.164) 

Log-likelihood 

-411.50 

-410.89 

-401.25 


is employed or damaged male self-esteem. Comparing models 1 and 3, there 
is clearly evidence for both state dependence and unobserved heterogeneity, 
the random effect variance being considerably larger when the previous state 
is unemployment. 

Model A ^4 suggested by Francis et al (1996) allows both the regression 
parameters and the effect of unobserved heterogeneity to depend on previous 
state, 


and 


^(Vtj = 1 l x ti ? yt-i，j = 0 ， Cj) 

^(Vtj = Vt-i,j — I? 0) 


exp(x；^° + Cj) 

1 + exp(x“/3 0 + Cj) ’ 

exp(x’ t 〆 + AQ) 

1 + exp(xj j /3 1 + ACj) 


The fixed effects of the covariates x t j are (3° if the previous state is 0 and 
(3 1 if the previous state is 1. There is true state dependence if (3° 7 ^ (3 1 after 
accounting for unobserved heterogeneity. In this model, spurious state de¬ 
pendence arises if (3° ^ (5 1 before introducing unobserved heterogeneity but 
(3° = f3 1 after taking the heterogeneity into account. The model can be written 
in the GRC formulation described in Section 4.2.3 as 


logit [Pr(y tj = 1 \xtj , , Ci)] = W ， 
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- 0.212 
1.411 (0.717) 

-396.99 


Fixed part 
Po [Cons] 

/3i [HUnemp] 
两 [Time] 

與 [Childl] 

Pt [Child5] 

席 [Age] 

PS [Agesq] 
Random part 
A 
畛 

Log-likelihood 


wives, [Child 1J substantially increases the odds of leaving employment (odds 
ratio exp(2.776) = 16.1), whereas for unemployed wives, this variable in¬ 
creases the odds of staying out of employment to a lesser extent (odds ratio 
exp(1.439)=4.2). 

In order to interpret the dependence structure, we will formulate the model 
as a latent response model, 


and 

where 


Vtj = + 0 + e tj if yt-i ， j = Q 

Vtj = + if y t -i,j = l, 

i mi 


It follows that the latent response correlations between two nonadjacent oc¬ 
casions s and u, 卜 一 > 1 ， are 


Poo = Cor(y*j, y^j |y s _i，j = 0, y u —ij = 0, x SJ -, x u j )= 


畛 + 丌 2 /3’ 


where the linear predictor is 

v tj = (1 一 2 /t—G[(l — yt—ij) + Xyt—ij]- 
The parameter estimates for this general Markov model are given in Ta¬ 
ble 9.11. The greatest difference in coefficients is for [Childl]. For employed 

Table 9.11 Estimates for general Markov model Ma 
Getting job Keeping job 

yt—i,j — Q? d = 0 yt—ij — 1, (1=1 

Parameter Est (SE) Est (SE) 


113 6 4 12 
9 6 17 5 3 0 
3 3 0 7 5 0 0 
3 . 1 . 0 . 2 . 0 . 0 . 0 . 

I I I I I 

ST S' 幻 o S' 乃？？ 
2 3 4 9 9 2 0 
. 4 . 7 . 0 . 6 . 3 . 0.0 

( o .( o .( o .( o .( o .( o .( o . 

16 3 9 3 19 
2 113 5 3 0 
. 5 . 9 . 0 . 4 . 2 . 0.0 

i —- 1 o 1 o o o 

- _ III 


3 3 0 0 3 9 2 3 
5 9 4 9 110 2 
3 4 0 4 4 0 0 2 

o.0.0.0.0.0.0.o. 
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A 2( 0 + 7r 2 /3’ 


pii = Cor (y*j , y^j\y s —ij = l,y u —ij = l, x s j , x nj -)= 


and 

入 

P01 = Cor (^- - y* uj \y^j =°- vu-u=^sj, x uj ) = -^====^===. 

Here, poo represents the within-wife residual correlation in the propensity to 
become employed when unemployed, estimated as poo = 0-30. pn is the within- 
wife correlation in the propensity to remain employed, estimated as pn = 0.02. 
Poi is the correlation between the propensity to become employed when not 
employed and the propensity to remain employed when employed, estimated 
as poi = —0.08. 

Initial conditions must be addressed in dynamic models. Conditions often 
invoked include that initial states are exogenous (the approach taken here for 
simplicity) or that the process is in equilibrium. However, a common problem 
is that the process under investigation is not studied from its beginning, im¬ 
plying that the first state observed cannot be exogenous. Heckman (1981b) 
suggests an ad hoc approach to approximate the initial conditions for dichoto¬ 
mous dynamic models. See Hsiao (2002) for a discussion of this and other 
approaches. 

9.7 Counting snowshoe hares: Capture-recapture models with 
heterogeneity 

Capture-recapture studies are often used to ascertain the size of a population, 
for example the number of heroin users in a city or the number of animals 
of a given species in some geographical area. The idea is to ‘capture’ indi¬ 
viduals from the population on different occasions or using different methods 
and record the identity of the captured individuals (for animal populations 
this requires marking the animals). The capture histories of those individuals 
captured at least once can then be used to estimate the number of individuals 
never captured and hence the total population size. A basic assumption here 
is that the population is constant or ‘closed’ throughout the study. 

Consider the simple example of two captures. If all individuals have the 
same chance of being captured and the probabilities of being captured on the 
two occasions are independent, then a large proportion of individuals captured 
on both occasion indicates a small population. However, an alternative expla¬ 
nation for a large number of recaptures is that some individuals are much 
easier to catch than others and it is these individuals who were captured 
twice. With only two captures, we cannot distinguish between these two ex¬ 
planations empirically and independence is usually assumed. With more than 
two captures, we can use the observed capture histories to estimate the degree 
of unobserved heterogeneity in catchability. Failing to account for unobserved 
heterogeneity can lead to biased estimates of population size (e.g. Otis et 
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al, 1978), although bias can to some extent be mitigated by design or by 
including observed covariates in the models. 

Burnham and Cuschwa (see Otis et al. ， 1978) laid out a livetrapping grid in 
a black spruce forest in Alaska to estimate the (closed) population of snowshoe 
hares. The basic grid was 10 x 10 with traps spaced 200 feet apart. Trapping 
was carried out for 9 consecutive days in early winter but traps were not baited 
for the first three days. Data were obtained on 68 captures and recaptures from 
the last 6 days of trapping. Table 9.12 shows the number of hares experiencing 
each of the possible capture histories, represented by indicators for capture 
(1 = capture, 0 = no capture) at each of the six occasions 7 . The count for the 
cell corresponding to no captures is unknown and our aim is to estimate this 
number. 


Table 9.12 Results of capture-recapture of snowshoe hares 


Captures 6,5,4 




Captures 3,2,1 
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0 

1 
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1 

0 

0 
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1 

0 

0 

0 
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0 

0 
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1 

1 

1 

2 

0 

2 

0 
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4 

0 

3 

0 

1 

0 

2 

0 
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2 

0 

1 

0 

1 

0 

1 

0 

111 

1 

1 

1 

0 

0 

0 

1 

2 


Source: Agresti (1994) 


A common approach to estimating the population size is Sanathanan’s 
(1972) conditional method. First a model is specified for the probabilities 
7T y of the capture histories y where y denotes a sequence of indicators (0 or 
1) for capture on each of the occasions. The parameters of this model are 
estimated by maximizing the conditional likelihood of the observable capture 
histories given that the individuals were captured at least once. The condi¬ 
tional probability of capture history y given that the individual was captured 
at least once is 

^cy = ^y/(1 _ 冗0...0)， 

where 7ro...o is the probability of never getting caught. The conditional log- 
likelihood therefore is 

^ ^ 7lyhvK C y = 〉 ： TlyhYKy — ^ ^ 77-ylll(l —兀 0 … o}， 

y y y 

7 The data can be downloaded from gllaimn.org/books 
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where the sum is over the observable capture histories, i.e. all possible histories 
excluding 0 ... 0, and n y is the number of individuals with history y. The total 
population size is then estimated by maximizing the binomial probability of 
capturing n = n y individuals at least once where the probability of success 
is (1 _7ro...o)，giving 

N = 

1 — 7T0...0 

Unobserved heterogeneity in catchability is likely in the capture-recapture 
study of snowshoe hares for several reasons. We would generally expect hetero¬ 
geneity due to behavioral differences such as trap-attraction or trap-avoidance. 
Moreover, hares with larger foraging areas are also exposed to more traps than 
those with smaller areas and hares near grid boundaries are less prone to cap¬ 
ture. 

Coull and Agresti (1999) use a random intercept logistic model to account 
for unobserved heterogeneity. The conditional probability of capture of indi¬ 
vidual j on occasion i is modeled as 

logit \Pr(yij = l|Cj)] = Pi Cj > 

where representing the 4 catchability ? of animal j, is normally distributed 
with mean 0 and variance This is the one-parameter logistic item response 
model discussed in Section 3.3.4. The probability of a given history 7r y with 
y = (2/i, •••,?//)’ then is 


/n 


rr exp ( 队(啟 + V^z)) 

1 + exp(/?i + V^z) 


(j){z) dz. 


Instead of assuming a normal distribution for the random intercept, we can 
assume that the population consists of latent classes with different constant 
levels of catchability, i.e. we can allow Q to be discrete so that 


TT ex P ( 扒 (A + gc)) ^ 

1 + ex p (戊 + 〜) 71 


where e c is the catchability of latent class c and 7r c is the probability of be¬ 
longing to class c. For identification, we restrict the mean of Q to be zero, 

E(Cj) — = 0 ， 


so that the variance of r]j is 

Var(Cj) = ^2n c e 2 c . 

The estimates for the homogenous population model, the model with a 
normal random intercept and the two-class model are shown in Table 9.13. 
The two-class solution has Si = 4.14, = —0.13, 7fi = 0.03 and 7?2 = 0.97. 

Increasing the number of classes to three only results in a small increase of 0.30 
in the conditional log-likelihood. Note that the estimated population sizes are 
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-249.41 

75.0 

(68,116) 


-1.83 (0.47) -1.36 (0.33) 

-0.98 (0.43) -0.51 (0.28) 

-1.51 (0.45) -1.03 (0.30) 

-1.11 (0.44) -0.64 (0.29) 

-1.30 (0.44) -0.83 (0.29) 

-0.74 (0.43) -0.29 (0.28) 


0.93 (0.63) 0.56 


-250.74 

92.1 

(74,153) 


-249.16 

77.1 

(70,88) 


Pi [Capl] 
P 2 [Cap2] 
Ps [Cap3] 
P4 [Cap4] 
[Cap5] 
p6 [Cap6] 
7 [Prev] 
Var(G) 
L c 
N 

95% Cl 


quite different for the two approaches to including unobserved heterogeneity 
(92 and 77), the latent class estimate being close to the conventional estimate, 
although the fit of the heterogeneity models is similar. 

The models assume that the dependence among the responses is purely due 
to unobserved heterogeneity. However, it is also possible that capture on one 
occasion directly affects the probability of capture on subsequent occasions 
(state dependence), particularly if the same method of trapping is used. Hug¬ 
gins (1989) therefore included a time-varying indicator of previous capture Xij 
([Prev]) in the random intercept model, where = 1 if the animal has been 
captured before and = 0 otherwise, 

logit [Pr ( 抑 . =1 1 Cj ? x ij )] = A + Cj + l x ij- 
For the snowshoe hare data and a normally distributed random intercept 
Q, the estimates are shown in the last column of Table 9.13. Here 7 =—1.10, 
indicating that animals are less likely to be caught again if they have previously 
been caught. The effect is not quite significant at the 5% level (p=0.07). The 
estimated population size is now 75, equal to the conventional estimate. 

Cormack (1992) suggests constructing confidence intervals for the true pop¬ 
ulation size N using the (unconditional) profile likelihood for N. The approach 
is to substitute different values for no...o, estimate the model parameters by 
maximizing the unconditional likelihood and evaluate the deviance of the 
model. The confidence limits are then the values of N = n-\-no...o that yield a 
deviance differing from the minimum by a prespecified value, 3.84 in the case 
of a 95% confidence interval. It is important to use the deviance rather than 
the log-likelihood itself since the log-likelihood of the saturated model changes 
with no...o- Cormack shows that the parameter estimates and deviance for the 


Table 9.13 Estimates for capture-recapture of snowshoe hares 

Homogen. Random Two Random int. 

population intercept classes & prev. hist. 

Parameter Est (SE) Est (SE) Est (SE) Est (SE) 


8030740 6 
3456676 5 

(o.(o.(o.(o.(o.(o.(o.(o. 

6993090 1 
.5.3.5.0.1.5. 1.0 

1 . O.o.o.o.o.l . 1 . 

I _ I _ I I 


9 5 7 5 6 5 - 
2 2 2 2 2 2 1 
(o.(o.(o.(o.(o.(o. 

u 2 1 4 2 9 
0 5 0 6 8 2 
1.0.1.0.0.0.- 

------ 
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Figure 9.9 Profile deviances for population size for four capture-recapture models 


conditional likelihood are identical to those for the unconditional likelihood 
when no...o is equal to the conditional estimate. 

The profile deviances for the four models are shown in Figure 9.9. The 
horizontal lines represent the minimum deviance plus 3.84. The vertical lines 
indicate approximate 95% confidence limits for the population size - integer 
values of N with deviances as close as possible, and no less than the value 
indicated by the horizontal line. The confidence intervals，given in Table 9.13, 
are quite wide, particularly for the random intercept model. 

Summarizing the findings, we can make the conservative statement that the 
population size lies somewhere between 68 and 153. 


9.8 Attitudes to abortion: A multilevel item response model 

In the British Social Attitudes Survey Panel 1983-1986 (Social and Com¬ 
munity Planning Research, 1987) 8 respondents were asked whether or not 
abortion should be allowed by law under the following circumstances: 

1. [Woman] the woman decides on her own she does not wish to have the child 

8 Data were supplied by the UK Data Archive. Neither the original data collectors nor the 
archive are responsible for the present analyses. 
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2. [Couple] the couple agree that they do not wish to have the child 

3. [Marriage] the woman is not married and does not wish to marry the man 

4. [Financial] the couple cannot afford any more children 

5. [Defect] there is a strong chance of a defect in the baby 

6 . [Risk] the woman’s health is seriously endangered by the pregnancy 

7. [Rape] the woman became pregnant as a result of rape 

The data have a three-level structure with occasions or 4 panel waves’ nested 
in individuals nested in polling districts. There were 14143 responses to the 
7 items over the four panel waves from 734 individuals in 57 polling districts. 
The multilevel design is highly unbalanced with 49% of subjects responding 
to at least one item in all four panel waves, 12% in three waves, 13% in two 
waves and 25% in one wave. Unit nonresponse was therefore common, but if 
an interview took place, item nonresponse occurred in only 7% of cases. We 
will not explicitly model unit or item nonresponse and therefore assume that 
the data are missing at random (MAR). 

Previous multilevel analyses of these data have used raw sumscores or 
scores constructed from item response models as response variable (Knott 
et al, 1990; Wiggins et al., 1990). However, using such constructed scores as 
proxies for latent variables has been demonstrated to be highly problematic, 
leading to biased standard errors and often to inconsistent parameter esti¬ 
mates (Skrondal and Laake, 2001). Hence, we use multilevel factor models 
with a logit link for the dichotomous items (Rabe-Hesketh et al” 2004a). The 
change in deviance is used to choose between competing models. Each model 
is fitted a number of times using adaptive quadrature comparing solutions 
with different numbers of quadrature points per dimension to ensure reliable 
results. 

Initially, we focus on between-subject heterogeneity and subsequently also 
include heterogeneity between polling districts. It is plausible that in addition 
to a ‘general attitude’ factor measured by all items there may be an inde¬ 
pendent ‘extreme circumstance’ factor 7]^ k representing people’s additional 
inclination to be in favor of abortion when there is a strong chance of a defect 
in the baby, a high risk to the woman, or where the pregnancy was a result 
of rape (items 5, 6 and 7). Using indices i for item or circumstance (level 1)， 
t for occasion (level 2), j for subject (level 3) and k for polling district (level 
3)，the two-factor model can be written in GRC notation as 

^itjk ^ "</3 + VGjk d ， i X G + VEjk S ， Ei X E, (9.5) 

where is a 7-dimensional vector with zth element equal to 1 and all other 
elements equal to 0 and is a 3-dimensional vector of indicators for items 
5, 6 and 7, equal to the last three elements of d “ for example 5 五 6 = (0,1,0)’. 
A unidimensional factor model appears to be inadequate since removing the 
extreme circumstance factor increases the deviance by 207.7 with 3 degrees of 
freedom. 
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Since there are repeated responses for each subject and item, item specific 
unique factors can be included at the subject level: 


=d;/3 + VGjk^i^ G ^Ejk^Ei^E + Vurnjk^ irn 


(9-6) 


where the latent variables are mutually independent. The unique factors ri^- k 
in the last term can be interpreted as heterogeneity between subjects in their 
attitudes to specific items which induces additional dependence between re- 
sponses over time not accounted for by the common factors. Evaluation of the 
log-likelihood for this model requires integration over 9 dimensions at level 3. 
To reduce the dimensionality, the items i can be treated as level-2 units so 
that time becomes level 1 and the model is reparameterized as 


^ti jk = + + + (9.7) 

Here the last term in (9.6) which evaluates to Vuijk ^ or item i has been replaced 
by Vuljk^ui- Whereas the r^\- k are treated as separate latent variables for the 
items, 1, • • •, 7, is a single latent variable with different realizations 
for different items i. The purpose of 入「 is to allow the unique factor variances 
to differ between the items. The models are equivalent since both ri\jl- k and 
VuIjk^Ui vary between items, are uncorrelated across items and have item- 
specific variances. The advantage of (9.7) is that a nine-dimensional integral 
at level 3 has been replaced by a one-dimensional integral at level 2 and a 
two-dimensional integral at level 3. It is often possible to reduce the dimen¬ 
sionality of integration by reparameterization to an equivalent model (see also 
Section 5.3.2). Adding unique factors at the subject level to the two-factor 
model decreases the deviance by 12.6, a small change for seven additional 
parameters. 

Introducing district-level latent variables in addition to subject level latent 
variables, the common factors can be allowed to vary between polling districts 
giving two-dimensional variance components factor models (see Section 4.3). 
Allowing the general attitude factor to vary between districts decreases the 
deviance by 8.2 with one extra parameter whereas the deviance decreases by 
only 3.2 for the extreme circumstance factor. The retained model is therefore 
the response model in (9.5) plus the structural model 

A path diagram for the retained model is given in Figure 9.10. It should 
be noted than the paths to the responses do not represent linear effects on 
the responses; the paths from t^q and represent linear effects on the log 
odds, whereas the short arrows represent random (Bernoulli) variability of the 
responses given the model-implied probabilities. 

Including unique factors at the district level increases the dimension of in¬ 
tegration at level 4 from 1 to 8. The dimensionality cannot be reduced by 
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Figure 9.10 Path diagram for the multilevel variance components factor model 


reparameterization in this case. We therefore included each unique factor sep¬ 
arately, but the changes in deviance were small. Estimates for the retained 
multilevel variance components logit factor model are given in Table 9.14. 
These were obtained using adaptive quadrature with 10 points per dimension 
which gave very similar results to 8 and 5 points per dimension. As expected, 
the intercepts for the extreme circumstance items were much larger than for 
the others due the larger prevalence of endorsing these items. A general atti¬ 
tude and an extreme circumstance factor were required at the subject level. 
Only the general attitude factor appeared to vary at the polling district level, 
but with a relatively small standard deviation. 


9.9 Summary and further reading 

We first considered longitudinal data on respiratory infection. We used a ran¬ 
dom intercept logistic regression model and compared the results to GEE. 
Useful reviews of analysis of clustered binary data include Neuhaus (1992) and 
Pendergast et al. (1996). Discussions of the pros and cons of conditional ver¬ 
sus marginal approaches are provided in Lindsey and Lambert (1998), Lindsey 
(1999) and Crouchley and Davies (1999). A two-level random intercept model 
was used to model change in condom use after HIV diagnosis by Skrondal et 
al. (2000) and a three-level random intercept model repeated neuropsycholog¬ 
ical measures in schizophrenics，their healthy relatives and unrelated controls 
by Rabe-Hesketh et al. (2001c). A two-level random coefficient model for longi¬ 
tudinal data on thought disorder has been considered by Diggle et al (2002) ， 
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Table 9.14 Estimates for the multilevel variance components logit factor model 
Fixed part 


Intercepts: 



f3i [Woman] 

-0.83 (0.14) 

P2 [Couple] 

-0.17 (0.15) 

/3s [Marriage] 

-0.28 (0.16) 

[Financial] 

-0.01 (0.14) 

^5 [Defect] 

3.79 (0.27) 

p6 [Risk] 

5.90 (0.56) 

07 [Rape] 

4.82 (0.39) 

Random part: Subject level 



Factor loadings 

General 

Extreme 

入 Gi &: A^i [Woman] 

1 

0 

Ag 2 & Xe 2 [Couple] 

1.13 (0.08) 

0 

入 G3 & A^3 [Marriage] 

1.21 (0.09) 

0 

Ag 4 &: Xe 4 [Financial] 

1.01 (0.08) 

0 

Ag 5 & A 丑 5 [Defect] 

0.78 (0.09) 

1 

入 G6 & A 丑 6 [Risk] 

0.73 (0.13) 

1.53 (0.26) 

Xg7 & Xe7 [Rape] 

0.72 (0.11) 

1.23 (0.21) 

Factor variances 



4 2 ) & 4 2) 

5.22 (0.67) 

3.30 (0.80) 

Random part: District level 



Factor variances 




0.36 (0.17) 

0 

Log-likelihood 

-5160.9 


Source: Rabe-Hesketh et al. (2004a) 


Skrondal and Rabe-Hesketh (2003c) and Rabe-Hesketh and Everitt (2003)， 
among others. 

The next application was a latent class model for the diagnosis of my¬ 
ocardial infarction. Other medical applications of such models are given in 
Formann and Kohlmann (1996). See also Section 13.5 for latent class models 
for rankings, Section 13.6 for first choices, Section 12.4.4 for durations and 
Sections 14.3 and 14.4 for multiple processes. A good overview of latent class 
modeling is given by Clogg (1995). 

We also considered one, two and three-parameter item response models with 
covariates for ability testing. Multidimensional versions of the two-parameter 
model for ordinal responses will be discussed in the next chapter and a mul¬ 
tilevel version was discussed in Section 9.8. 

A meta-analysis of the effectiveness of nicotine gum for smoking cessation 
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was conducted, using random effects models for patient-level data. Results 
from Bayesian and likelihood methods were compared. Although extremely 
popular in medicine, it should be noted that meta-analysis is also gaining 
popularity in other disciplines such as economics (e.g. Granger, 2002) and 
sociology (e.g. DiPrete, 2002). Useful books on meta-analysis include Hedges 
and Olkin (1985) and Whitehead (2002)，and useful reviews are given by Fleiss 
(1993) and Normand (1999). 

Wives, employment transition data were then analyzed using different types 
of Markov models with random effects to explore the issue of true versus 
spurious state dependence. In this chapter the response has been treated as 
dichotomous, but could alternatively be viewed as a discrete time duration; 
see Chapter 12. 

Another application concerned the estimation of the population size of 
snowshoe hares using capture-recapture models with unobserved heterogene¬ 
ity. A useful review of these methods is given by Chao et al. (2001). 

Finally, we described a multilevel item response model for attitudes to abor¬ 
tion. Fox (2001) explores multilevel structural equation models for dichoto¬ 
mous responses in an educational setting. Ansari and Jedidi (2000) and Fox 
and Glas (2001) discuss Bayesian estimation of multilevel item response mod¬ 
els. 

All models considered in this chapter have used the logit link, but could 
also have been formulated in terms of probit or complementary log-log links. 
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CHAPTER 10 


Ordinal responses 


10.1 Introduction 

The first theme of this chapter is ‘growth curve’ models for analyzing the 
effect of a cluster randomized intervention on ordinal responses measured at 
several occasions. Initially, we discuss multilevel growth models for repeated 
measures of a particular ordinal observed response. These models are subse¬ 
quently extended to growth models for a latent variable that is repeatedly 
measured by several ordinal items at each occasion. 

The other theme of the chapter is 4 psychometric validation’ of measurement 
instruments with ordinal items. In particular, we demonstrate how properties 
such as factor dimensionality, item reliability and item bias can be investi¬ 
gated. 

10.2 Cluster randomized trial of sex education: Latent growth 
curve model 

10.2.1 Introduction 

A cluster randomized trial is one where clusters of units rather than the units 
themselves are randomized to treatment groups. A typical application is the 
evaluation of nontherapeutic interventions, for instance the effect of different 
modes of sex education on use of contraceptives. 

Cluster randomized trials have several merits: First, some treatments are 
most naturally applied at the cluster level. This is obviously the case with 
sex education which takes place in school classes, making randomization of 
individual students impractical. Second, cluster randomized trials reduce ex¬ 
perimental contamination. In the sex education example, contamination would 
occur if students receiving the intervention would share their knowledge with 
students not receiving the intervention. Such contamination can be minimized 
by randomizing at the school level, assuming that there is little communication 
among students from different schools. 

There are, however, some disadvantages of cluster randomized trials: First, 
units in a cluster are often more similar than units in different clusters. This 
implies that units cannot be treated as independent in statistical modeling; 
the dependence among units within clusters must be accounted for. Second, 
and related to the first issue, cluster randomized trials are usually less efficient 
than classical randomized trials. 

In this section we will analyze data from a cluster randomized trial of sex 
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education for 15 and 16 year olds in Norwegian schools (Traeen, 2003) 1 . A 
school book and curriculum for sex education was developed. This included 
dramas created for students to perform, debates of specific questions and 
practical tasks such as finding out how to get hold of contraception. The 
intervention was designed to 4 make adolescents competent actors in sexual 
contexts in the sense that they dared handle contraception and put up limits’. 

The outcome of interest, whether contraception was being used, was only 
available on a minority of the adolescents who were sexually active. Instead 
of actual behavior, the hypothetical construct ‘contraceptive self-efficacy’ was 
hence studied. This construct has previously been shown to be a good pre¬ 
dictor of contraceptive use (e.g. Kvalem and Traeen, 2000). In this section we 
focus on the constituent construct Situational contraceptive communication ’， 
measured by three questionnaire items: 

“If my partner and I were about to have intercourse without either of us 
having mentioned contraception ... 

• [Tell] I could easily tell him/her that I didn’t have any contraception” 

• [Ask] I could easily ask him/her if he/she had any contraception” 

• [Get] I could easily get out a condom (if I had one with me)’’ 

The questions were answered in terms of five ordinal response categories: 

1. Not at all true of me 

2. Slightly true of me 

3. Somewhat true of me 

4. Mostly true of me 

5. Completely true of me 

Schools were randomized to receive the intervention or not. Questionnaires 
were completed prerandomization and 6 months and 18 months postrandom¬ 
ization. The data therefore have a three-level structure with occasions t nested 
in students j nested in schools k. 46 schools and 1184 students contributed 
to the analysis. Only 570 students always responded, 400 responded on some 
occasions and 114 never responded. The two predictors we will use here are 

• [Treat] dummy variable for student being in school receiving treatment Xijk 
(yes=l, no=0) 

• [Time] time since randomization in 6-month periods X2tjk (0, 1, 3) 

10.2.2 Growth curve modeling 

We will initially estimate a multilevel proportional odds model for one of the 
items, [Get]. We will allow the mean of the latent response to depend on [Time] 
(a linear trend), [Treat] and [Time] x [Treat] and include random intercepts 

1 We thank Bente Traeen for providing us with these data. 
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latent responses of 0.37. There is a small treatment effect [Time] x [Treat], the 
estimate 0.17 corresponding to an odds ratio of 1.19. Therefore the percentage 
increase in the odds of high versus low responses per six month period is 19% 
higher in the treatment group than the control group. (Here high versus low 
response can either mean response 5 versus 1 to 4, or responses 4 or 5 versus 
1 to 3 or responses 3 to 5 versus 1 or 2 or responses 2 to 5 versus 1.) 

To visualize this treatment effect, Figure 10.1 shows the population averaged 
probabilities Pr(y t jk s\x.tjk\0) of response s or above, 5 = 2, 3,4,5, by time 
and treatment group. These probabilities were obtained by integrating the 
conditional response probabilities, given the random effects, over the random 
effects distribution. The corresponding observed proportions are also shown. 
It is worth pointing out that linear relationships on the logit scale do not 
generally look this linear on the probability scale. 



- intervention group, - control group, 


• predicted, o observed 


Figure 10.1 Predicted and observed marginal response probabilities 


10.2.3 Latent growth curve modeling 

Since there is no evidence for variability between schools, we will henceforth 
omit the k subscript. We first develop a measurement model for contraceptive 
self-efficacy rjtj, measured by the three ordinal items yuj ; i = 1 [Tell] ， i = 2 


© 2004 by Chapman & Hall/CRC 










[Ask] and i = 3 [Get] • One-factor models with three different specifications of 
thresholds Hi s and intercepts Si were considered: 

• Different thresholds for each item i and no intercepts (12 parameters) 


Vitj 


1 

f Vuj < 

2 

f «il < Vitj < K *2 

3 

f «i2 < Vitj < «i3 

4 

f «i3 < Vuj < K *4 

l 5 

f «i4 < Vitj- 


• One set of thresholds for all three items and no intercepts (4 parameters) 


Vitj 


2 

3 

4 

5 


if Vuj < 
if «i < Vitj < 

if k 2 < y* tj < k 3 

if k 3 < y* tj < k 4 

if k 4 < Vuj- 


• One set of thresholds for all three items and intercepts 82 and ^3 for items 
2 and 3 (6 parameters). 

The log-likelihoods are —6946, —6990 and —6950, respectively, so that the last 
model is retained. The latent response y* t j for the ith. item at occasion t for 
student j is therefore modeled as 

Vitj = & + + e itj ， 入 1 = 1 ，占 1 = 0 ， 

with constant thresholds k s , s = 1 , 2 ,3,4. 

We then combine the measurement model with a structural model for con¬ 
traceptive self-efficacy 

Vtf = H^ij + 72^2tj + j3XijX 2t j + Vj S) + Ctf ^ vf )= Cj S \ 

where ”; 3 ) is a random intercept at the student level and an occasion 
specific random intercept. A path diagram of this latent growth curve model 
is shown in Figure 10.2 where the three latent variables 77 ^, 77 ^ and 
represent t = 1,2, 3, and analogously for 迄 ) and x t j. We can alternatively 
place rj^ and x t j into a 4 level-2 5 box and present the model as in Figure 10.3. 


The parameter estimates for the latent growth curve model are given in 
Table 10.2. The treatment effect is of a similar magnitude as before. Surpris¬ 
ingly, there is a decline in contraceptive self-efficacy in the control group, but 
efficacy increases in the treatment group as expected. There are large vari¬ 
ances both between students and between occasions within students. [Ask] 
and [Get] appear to be 4 easier 5 than [Tell] since the estimates of S 2 and ^3 are 
positive. Although the factor loadings are quite close to 1， the low standard 
errors suggest that they should not be constrained to 1 . 
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Table 10.2 Estimates for latent growth curve model 


Parameter 
Structural model 
Regression coefficients 

71 [Time] 

72 [Treat] 

73 [Time] x [Treat] 

Variances 

occasion-level 

0( 2 ) 

student-level 

畛⑻ 

Measurement model 

Intercepts 
(5i [Tell] 

S 2 [Ask] 

[Get] 

Factor loadings 
Ai [Tell] 

A 2 [Ask] 

A 3 [Get] 

Thresholds 

托 l 
托 2 
托 3 
托 4 

Log-likelihood 


Following Skrondal (1996), we will consider the 1719 respondents of the 1974 
cross-section of the American subsample. For the present purposes, we have 
included 4 Don’t know’ responses as missing values (see also Rubin et al” 1995). 
The univariate frequency distributions of the efficacy items are presented in 
Table 10.3 and the frequency distribution of the number of items with missing 
values is reported in Table 10.4. The 1710 respondents who responded to at 
least one efficacy item are analyzed here under the missing at random (MAR) 
assumption. 


E)sr 旬 1 

SI.0.2. 1.4.4 

( ⑴⑴ 



8 8 4 4 句 

. 0 . 0 . 0 . 0 . 2 . 2 . 2.2 

⑼⑼ (o'(o ^ ⑼⑼⑼⑼ 

2 2 9 1 6 4 2 7 
.4.5.0.9.4.7.8.8 
0.0.1.0.5.3.1.0. 
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Table 10.3 Univariate frequency distributions of the political efficacy items 



4 

3 

2 

1 

Missing 

[Nosay] 

175 

518 

857 

130 

39 

[Voting] 

283 

710 

609 

80 

37 

[Complex] 

343 

969 

323 

63 

21 

[Nocare] 

250 

701 

674 

57 

37 

[Touch] 

273 

881 

462 

26 

77 

[Interest] 

264 

762 

581 

31 

81 


Table 10.4 Frequency distribution of number of items with missing values 
Number of missing items 0 1 2 3 4 5 6 

Frequency 1554 106 26 18 4 2 9 


10.3.2 Factor dimensionality and reliability 

Let us first consider the factor dimensionality of the political efficacy items. 
For this purpose we will use an ordinal probit factor model of the form 

y*j = A ^ + e J- 

We have omitted the constants so that we can identify all four thresholds 
Ku, ..., K ， 4i for each item i. In the case of a unidimensional ordinal probit 
factor model, we then obtain what is referred to as the graded response model 
(Samejima, 1969) in item response theory (IRT). 

The unidimensional factor model provides a formalization of the concept 
of unidimensionality which appears to concur with the ideas of applied scien¬ 
tists (McDonald, 1981). When it comes to multidimensionality the picture is 
less clear. For instance, it is possible to formulate a number of factor models 
which are consistent with different notions of bidimensionality. Here, we will 
explicate four kinds of factor bidimensionality, ordered in degree from strict 
to weak: 

1. Strict factor bidimensionality is formalized in terms of an independent clus¬ 
ters factor model where the items only measure the dimensions they are 
purported to measure. Thus, A is specified as block-diagonal. The factor 
dimensions are moreover a priori specified as orthogonal; the covariance ma¬ 
trix of the factors 屯 is diagonal. If this model is retained, the dimensional 
validity of the items is maximal. 

2. Strong factor bidimensionality is also formalized in terms of an independent 
clusters factor model, but the factors are permitted to be correlated (see 
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top panel of Figure 10.4 for an example). Hence, this concept of strong 
bidimensionality is somewhat weaker than strict factor dimensionality. 

3. Intermediate factor bidimensionality applies if one or more items, but not 
all, measure both dimensions. Such composite items are problematic, since 
they measure different phenomena. 

4. Weak factor bidimensionality corresponds to an unrestricted or exploratory 
factor model (Joreskog, 1969; Lawley and Maxwell, 1971). All items are 
permitted to reflect both dimensions of political efficacy (subject to iden¬ 
tification restrictions). Hence, this is the weakest possible formalization of 
bidimensionality in factor models. If this model is retained, the dimensional 
validity of the items is minimal. How to specify exploratory factor models 
as equivalent confirmatory factor models was discussed in Section 3.3.3; 
see the lower panel of Figure 10.4 for a parameterization of the exploratory 
two-factor model. 

There is a voluminous literature on the measurement properties of the po¬ 
litical efficacy items, and a number of alternative measurement models have 
been proposed. Some authors have argued for the unidimensionality of efficacy, 
whereas the predominant position clearly favors bidimensionality. Controversy 
reigns, however, when it comes to which items measure what dimension. 

Here, we will confine the discussion to the so-called NES (National Elec¬ 
tion Studies)-model (Miller et al., 1980). Miller et al. (1980, p. 253) present 
the following interpretation of the two dimensions of political efficacy: one 
dimension is interpreted as “individuals’ self-perceptions that they are capa¬ 
ble of understanding politics and competent enough to participate in political 
acts such as voting”，and the other as “individuals’ beliefs about political 
institutions rather than perceptions about their own abilities”. The first di¬ 
mension is dubbed Internal efficacy’ (or personal political competence) and 
the second dimension is called 4 external efficacy，(or political system respon¬ 
siveness). Miller et al. suggest that [Nosay], [Voting] and [Complex] measure 
internal efficacy，whereas [Nocare] ， [Touch] and [Interest] measure external ef¬ 
ficacy. Hence, it appears to be reasonable to interpret the NES-model as a 
model with strong bidimensionality (see top panel of Figure 10.4). 

At this point, it is important to point out that determination of the dimen¬ 
sionality of factor models is often treated in a somewhat superficial manner 
in the literature. Caution should be exercised for a number of reasons. First, 
a problem of examining absolute fit is the multiple sources of discrepancy 
between model and data. Lack of fit may not exclusively be due to misspecifi- 
cation of the dimensionality, but may reflect any misspecifaction of the factor 
model. This problem is compounded with categorical data, where scalar spec¬ 
ifications of thresholds will influence the absolute fit. Second, as was pointed 
out in Section 8.5.2, there is a multiplicity of possibly contradictory goodness 
of fit criteria. Third, the equivalence problem must be faced. This can be il¬ 
lustrated from the literature on political efficacy. Craig and Maggiotto (1982) 
specified a two-factor model where [Nosay] ， [Nocare], [Touch] and [Interest] 
measure one dimension, whereas [Voting] and [Complex] measure the other 
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dimension. In contrast, Mason et al. (1985) specify a model with three fac¬ 
tors where the second factor of Craig and Maggiotto is split into two factors, 
measured by [Voting] and [Complex] respectively. Both models are identified, 
notwithstanding a somewhat contrived identification of the second model, 
and equivalent. It follows that no empirical arguments can be used in choos¬ 
ing between these competing two and three-dimensional conceptualizations of 
political efficacy (see also Section 5.1). 

Keeping the above problems in mind, we now proceed to compare the NES- 
model M2 with the weak dimensionality model MX. These models are de¬ 
picted in Figure 10.4. We note that M2 is nested in since it results from 
setting the factor loadings 入 41 , 入 51 , 入 22 and 入 32 of the latter model to zero. 
Consequently, likelihood-ratio tests can be used to compare the fit of the two 
competing models. The maximum likelihood estimators for the different mod¬ 
els are now obtained under the specification of bivariate normal latent traits, 
utilizing all available data. The estimated factor loadings and factor variances 
and covariances are displayed in Table 10.5 and the thresholds in Table 10.6. 


Table 10.5 Estimated factor loadings and (co)variances 



M2 

Ml 

Internal 

External 

Internal 

External 

Factor loadings 





入 ik [Nosay] 

1 

0 

1 

0 

入 2fc [Voting] 

0.52(0.05) 

0 

0.69 (0.10) 

-0.18 (0.10) 

入 3k [Complex] 

0.77 (0.07) 

0 

0.56 (0.08) 

0.15 (0.08) 

入 4fc [Nocare] 

0 

1 

0.72 (0.12) 

1 

A 5 fc [Touch] 

0 

0.74 (0.05) 

-0.09 (0.17) 

1.41 (0.26) 

入 [Interest] 

0 

0.86 (0.06) 

0 

1.53 (0.16) 

Factor (co)variances 





0.81 (0.09) 

2.67 (0.31) 

0.91 (0.13) 

1.02 (0.21) 

嗲 12 

1.24 (0.10) 

0.73 (0.10) 

Log-likelihood 

一 9950.43 

一 9924.04 


The two dimensions of political efficacy appear to be rather highly corre¬ 
lated (0.84 for M2 and 0.76 for A41), which is also reflected in the empirical 
Bayes or factor score plot in Figure 10.5. Comparing M2 with Ml, the like¬ 
lihood ratio statistic is 52.79 with 4 degrees of freedom. We see that strong 
bidimensionality is clearly implausible for the political efficacy items, and weak 
bidimensionality must be retained. Thus, the dimensional validity of the effi¬ 
cacy items is low. We note in passing that different models with intermediate 
bidimensionality have been proposed (e.g. Aish and Joreskog, 1989), but such 
models will not be pursued here. 

Consider next the reliabilities of the different items under the retained 
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Table 10.6 Estimated thresholds for models A42 and Ml 



Kil 



M2 

[Nosay] 

-1.70 (0.06) 

-0.28 (0.04) 

1.89 (0.07) 

[Voting] 

-1.06 (0.04) 

0.25 (0.04) 

1.84 (0.06) 

[Complex] 

-1.01 (0.05) 

0.91 (0.04) 

2.13 (0.07) 

[Nocare] 

-2.00 (0.10) 

0.35 (0.06) 

3.36 (0.15) 

[Touch] 

-1.53 (0.07) 

0.86 (0.06) 

3.27 (0.12) 

[Interest] 

-1.69 (0.07) 

0.58 (0.06) 

3.50 (0.15) 

Ml 

[Nosay] 

-1.74 (0.08) 

-0.29 (0.05) 

1.94 (0.08) 

[Voting] 

-1.08 (0.04) 

0.26 (0.04) 

1.90 (0.07) 

[Complex] 

-0.99 (0.05) 

0.90 (0.04) 

2.09 (0.07) 

[Nocare] 

-1.96 (0.10) 

0.35 (0.06) 

3.30 (0.15) 

[Touch] 

1.63 (0.09) 

0.91 (0.07) 

3.48 (0.16) 

[Interest] 

-1.80 (0.10) 

0.62 (0.06) 

3.72 (0.20) 



Internal efficacy rjj^ 

Figure 10.5 Empirical Bayes factor scores of political efficacy (Ml) 
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of item i is assumed to be zero. The estimated lower bounds of the reliabilities 
for Ml are 0.48, 0.23 and 0.30 for the internal efficacy items [Nosay], [Voting] 
and [Complex], respectively, and 0.72, 0.65 and 0.71 for the external efficacy 
items [Nocare], [Touch] and [Interest]. The lower bounds are rather small for 
the internal efficacy items. 


10.3,3 Item bias 

Let us now consider the item bias or differential item functioning (DIF) of the 
political efficacy items. In item response theory (IRT) an item is 4 biased’ if 
the response to the item is dependent on extraneous information apart from 
the factors. (See also Section 9.4, page 297 for an investigation of item-bias 
in a dichotomous item response model.) Since item-bias is closely related to 
the ‘fairness’ of tests, it comes as no surprise that claims of racial and ethnic 
bias have led to a heated public debate and even lawsuits. Our analysis of 
item-bias is based on the validation approach introduced by Muthen (1985, 
1988b, 1989d). We find this methodology more direct and elegant than the 
standard approaches in IRT, surveyed by e.g. Hambleton and Swaminathan 
(1985) and Hambleton et al. (1991). 

We specify the model 

y*j = Xj/3 +A% +〜.， 

Vj =rwj + Cj. 

First consider a model without item-bias, /3 = 0, a MIMIC model where 
the efficacy factors are regressed on covariates. It follows from this speci¬ 
fication that the expectations of the factors become heterogenous, and the 
response probabilities are no longer homogeneous. The items are nevertheless 
not biased, since all heterogeneity is transmitted through the factors. From 
our previous results on factor dimensionality, an unstructured factor model is 
specified. Following the discussion in Abramson (1983) and Listhaug (1989), 
we have selected two covariates: 

• [Educ] standardized education in years 

• [Black] dummy variable for being black 

The resultant MIMIC model is denoted where we have specified 722 = 0 
([Black] has no effect on external efficacy) based on a preliminary analysis. 
Discarding three respondents with missing values on either [Black] or [Educ] 
yields a sample size of 1707. 

A model incorporating item bias is now specified as a generalized MIMIC 
model where the covariates have direct regression effects on some items, in 
addition to the indirect effects via the factors. In terms of the model parame¬ 
ters, this means that all elements of (3 are no longer zero. This model is called 
MS. Note that M3 is not given a priori, but on the contrary to be suggested 
by exploratory analysis of our data. Performing a simple cross-validation (see 
Section 8.5.4)，our sample has been randomly divided into an exploration sam¬ 
ple of size 840 and a confirmation sample of size 867. We are then free to delve 
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in our exploration sample, and can subsequently test the competing models on 
the confirmation sample. Exploration suggests that [Educ] and [Black] have 
direct effects on [Voting] (/3i and P2) and [Educ] has a direct effect on [Com¬ 
plex] (J3s). Path diagrams of models M3 and M4 ： are shown in Figure 10.6. 


Table 10.7 Estimates for MIMIC models 



M4 


M3 


Internal 

External 

Internal 

External 

Structural model 





Factor regressions 
7fci [Educ] 

0.34 (0.04) 

0.24 (0.08) 

0.38 (0.07) 

0.28 (0.06) 

7&2 [Black] -0.25 (0.08) 

0 

-0.37 (9.16) 

0 

Factor (co)variances 




^kk 

0.67 (0.05) 

0.92 (0.23) 

0.97 (0.17) 

1.05 (0.15) 

屯 12 

0.73(0.09) 

0.69 (0.07) 

Measurement model 





Factor loadings 
Aifc [Nosay] 

1 

0 

1 

0 

A 2 fe [Voting] 

1.67 (0.45) - 

-0.85 (0.31) 

0.33 (0.10) 

0.04 (0.10) 

A3fe [Complex] 

1.64 (0.41) - 

-0.60 (0.26) 

0.20 (0.09) 

0.32 (0.10) 

入 4 知 [Nocare] 

0.82 (0.31) 

1 

0.63 (0.18) 

1 

入 5fc [Touch] - 

-0.01 (0.30) 

1.34 (0.44) 

-0.11 (0.18) 

1.31 (0.30) 

A6fc [Interest] 

0 

1.79 (0.43) 

0 

1.59 (0.25) 

Item regression 
/3i [Voting] 





x [Educ] 

0 


0.20 (0.05) 

fh [Voting] 
x [Black] 

03 [Complex] 

0 


-0.34 

(0.14) 

x [Educ] 

0 


0.28 (0.04) 

Log-likelihood 

-4978.08 

-4973.36 


The parameter estimates for the confirmation sample are reported in Ta¬ 
ble 10.7, except for the thresholds which are given in Table 10.8. We note that 
[Black] is negatively related to internal political efficacy, whereas [Educ] is pos¬ 
itively related to both kinds of political efficacy, as might be expected. Since 
model is clearly nested in M3, likelihood-ratio tests can be performed 
on the confirmation sample. From Table 10.7, the likelihood ratio statistic is 
9.43 with 3 degrees of freedom indicating that the model with item-bias Ai3 
should be retained, although the evidence is not overwhelming. It can be seen 
from Table 10.7 that there is substantial item bias for two of the internal effi- 
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«l2 


M4 


[Nosay] 

-1.51 (0.08) 

[Voting] 

-1.15 (0.07) 

[Complex] 

-1.04 (0.07) 

[Nocare] 

-1.85 (0.12) 

[Touch] 

-1.52 (0.10) 

[Interest] 

-1.77 (0.13) 

M3 


[Nosay] 

-1.75 (0.17) 

[Voting] 

-1.10 (0.06) 

[Complex] 

-0.95 (0.06) 

[Nocare] 

-1.92 (0.14) 

[Touch] 

-1.57 (0.12) 

[Interest] 

-1.80 (0.15) 


I^Si 


-0.32 (0.06) 1.74 (0.08) 

0.26 (0.05) 1.92 (0.10) 

0.92 (0.06) 2.09 (0.11) 

0.30 (0.08) 3.28 (0.17) 

0.85 (0.08) 3.19 (0.16) 

0.66 (0.09) 3.88 (0.27) 


-0.38 (0.08) 2.02 (0.19) 

0.24 (0.05) 1.82 (0.09) 

0.90 (0.06) 2.01 (0.09) 

0.31 (0.09) 3.39 (0.20) 

0.89 (0.09) 3.32 (0.20) 

0.66 (0.09) 3.90 (0.29) 


cacy items. [Black] has negative direct effects on the response for [Voting] and 
[Educ] has positive direct effects on the responses for [Voting] and [Complex]. 


10.3.4 Conclusion 

In summary, the psychometric validity of the political efficacy items appears to 
be dubious. Only a weak kind of bidimensionality is retained, the reliabilities 
appear to be low and substantial item-bias is found. We note that the two 
latter problems are compounded for the internal efficacy items. The problems 
unmasked here may be due to the conceptual gap between measures with 
quasi-theoretic status on the one hand and theories subsequently developed on 
the other hand (Mason et al, 1985). We conclude that the NES measurement 
instrument for political efficacy investigated here might best be abandoned. 
It is interesting to note in this connection that Converse (1972, p.334), one of 
the elder statesmen of this area, stated that: 

“The political efficacy scale with which we have worked since 1952 involves a 
considerable blend." 

Finally, we observe that a new instrument for political efficacy was imple¬ 
mented in NES 1988 (e.g. Niemi et al, 1991). 
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10.4 Life satisfaction: Ordinal scaled probit factor models 

10.4-1 Introduction 

Satisfaction of life is a phenomenon which has attracted both normative and 
empirical interest. The most common approach in empirical research on life- 
satisfaction appears to be based on people’s reported perceptions of satisfaction 
(e.g. Campbell et al, 1976; Andrews and Withey, 1976). 

In this section, we present an empirical analysis of reported perceptions 
of life satisfaction among Americans, previously reported in Skrondal (1996). 
This methodology enables us to investigate the dimensionality of life-satisfaction 
and the quality of the individual items. Having obtained a retained model 
for life-satisfaction, the properties of the model are presented by means of a 
graphical procedure advocated by Lazarsfeld (1950). 

The data employed here are based on the 1989 version of the General Social 
Survey 3 (GSS). The GSS is a cross-sectional survey of the noninstitutionalized 
residential population of the continental USA aged 18 and over (NORC, 1989). 

It has been conducted almost annually by the National Opinion Research 
Center (NORC) at the University of Chicago since 1972. The purpose of the 
GSS is to monitor social trends in attitudes and behavior. 

The question wording of the life satisfaction items is: 

“For each area of life I am going to name, tell me the number that shows how 
much satisfaction you get from that area”. 

Five different areas are evaluated by the respondents: 

• [City] the city or place you live in 

• [Hobby] your nonworking activities - hobbies and so on 

• [Family] your family life 

• [Friend] your friendships 

• [Health] your health and physical condition 

The respondents’ numerical answers to the items correspond to a rating 
form labeled as 

1. A very great deal 

2. A great deal 

3. Quite a bit 

4. A fair amount 

5. Some 

6 . A little 

7. None 

3 The data used in this section were compiled by the National Opinion Research Center 
(NORC) and made available by the Norwegian Social Science Data Services (NSD). 
Neither NORC nor NSD are responsible for the analysis presented here. 
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Such a rating form is often denoted a Likert form in attitude measurement 
after Likert (1932). 

The GSS has employed a split-ballot design (Smith, 1988) since 1988. Specif¬ 
ically, some items are permanent in all surveys, whereas others, including the 
life satisfaction items, are rotated among three ballots. We confine the anal¬ 
ysis to the 1035 respondents of ballots B and C who were presented the life 
satisfaction items in 1989. The univariate frequency distributions of the items 
are presented in Table 10.9. and the frequency distribution of the number of 


Table 10.9 Univariate frequency distributions of the life satisfaction items. 



7 

6 

5 

4 

3 

2 

1 Missing 

[City] 

178 

283 

217 

196 

63 

69 

22 

2 

[Hobby] 

243 

368 

178 

109 

40 

53 

34 

5 

[Family] 

433 

344 

102 

67 

22 

44 

15 

3 

[Friend] 

343 

397 

156 

78 

14 

32 

8 

2 

[Health] 

260 

324 

174 

170 

30 

48 

22 

2 


items with missing values is reported in Table 10.10. The 1030 respondents 


Table 10.10 Frequency distribution of the number of items with missing values 
Number of missing items 0 1 2 3 4 5 

Frequency 1016 14 0 0 0 5 


who responded to at least one efficacy item are analyzed here under the miss¬ 
ing at random (MAR) assumption. We note that there are remarkably few 
missing values for the life satisfaction items of GSS 1989. 

Specification of a graded response model would lead to the following unre¬ 
stricted threshold model for each item i: 


,1 

f ytj < 

2 

f Ka < y*j < K i2 

3 

f Ki2 < y*j < Ki3 

4 

f K i3 < y*j < Ka 

5 

f < y*j < n i5 

6 

f K i5 < Vij < K i6 

7 

if K i6 < y*j, 


all in all 30 threshold parameters. 

Due to the large number of response categories, this model has an excessive 
number of parameters. Clogg (1979) and Masters (1985) therefore collapsed 
response categories. Their particular collapsing was criticized by Thissen and 
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Steinberg (1988) who suggested a more sensible collapsing. However, a better 
solution is feasible here since the items are homogeneous in the sense that the 
same Likert rating form is used for all items. In this case it makes sense to 
constrain the thresholds to be equal across items and introduce an intercept 
for each item as in Section 10.2.2, 

Vij = Pi XiVj + ， ^is = 托 l = 0_ 

Here the intercepts for all items are identified because is set to 0. As shown 
on page 147 in Section 5.2.3, setting the thresholds equal across items does not 
only identify the intercepts but also the relative scales of the latent responses, 
giving a scaled ordinal probit model (see also Section 2.3.4), 

〜 N(O,0n), % _ 1. 

Hence the locations and scales of the latent responses differ between the items. 


10.4.2 Factor dimensionality 

We consider the analysis of all five life-satisfaction items from GSS. This is 
in line with Muraki (1990), but in contrast to Clogg (1979), Masters (1985) 
and Thissen and Steinberg (1988) who, for undisclosed reasons, confined the 
analysis to three of the items. Clogg (1988), on the other hand, considered 
four of the items. 

A unidimensional model Ml for the life-satisfaction items is first specified, 
with identification restrictions 

= 0, Ai = 1, On = 1. 

Estimated parameters and standard errors are reported in the second column 
of Table 10.11. Inspection of the estimated parameters strongly suggests that 
all diagonal elements of © are close to unity, apart from 0 44 . Hence, we next 
specify model M2 incorporating the restrictions 

O22 — ^33 = ^55 = 1 ( 10 . 2 ) 

in Ml. The estimated parameters and standard errors of this model are given 
in the third column of Table 10.11. The log-likelihoods for models M2 and 
Ml are given in the same table. The likelihood ratio statistic is 0.96 with 3 
degrees of freedom so the restrictions appear to be innocuous, although we 
are guilty of ‘data snooping’ here. 

Consider now A43, which is obtained from MX by imposing the restriction 

0 = 1， 

or alternatively from the specification of 644 二 1 in M2. The likelihood ratio 
statistic for comparing models M3 and M2 is 58.52 with 1 degree of freedom 
from which it follows that A43 is clearly rejected. Thus, the residual variance 
seems to be considerably lower for the [Friend] item than the other items, 
which have the same residual variance. 
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Table 10.11 Estimated parameters of models Ml and M2 



Ml 

M2 

Fixed part 
Thresholds 

0 

0 


1.08 (0.04) 

1.10 (0.02) 


1.64 (0.05) 

1.67 (0.03) 


2.25 (0.07) 

2.29 (0.03) 


2.64 (0.08) 

2.69 (0.04) 

«6 

3.13 (0.09) 

3.19 (0.05) 

Intercepts 

/3i [City] 

1.13 (0.05) 

1.15 (0.04) 

(h [Hobby] 

0.82 (0.05) 

0.84 (0.04) 

/?3 [Family] 

0.22 (0.05) 

0.22 (0.05) 

/?4 [Friend] 

0.44 (0.04) 

0.45 (0.04) 

/3 S [Health] 

0.81 (0.05) 

0.83 (0.04) 

Random part 

Factor loadings 


Ai [City] 

1 

1 

A 2 [Hobby] 

1.44 (0.18) 

1.43 (0.17) 

A3 [Family] 

1.98 (0.25) 

1.96 (0.23) 

A4 [Friend] 

1.81 (0.22) 

1.82 (0.19) 

A 5 [Health] 

1.44 (0.18) 

1.43 (0.17) 

Factor variance 


ipn 

0.48 (0.04) 

0.49 (0.04) 

Residual variances 


O11 [City] 

1 

1 

622 [Hobby] 

0.97 (0.09) 

1 

O3S [Family] 

0.93 (0.09) 

1 

O44 [Friend] 

0.45 (0.05) 

0.47 (0.05) 

055 [Health] 

0.93 (0.08) 

1 

Log-likelihood 

-7669.26 

-7669.75 
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Figure 10.7 Empirical Bayes factor scores for life-satisfaction fA44j 


Let us now consider an alternative bidimensional model for the life-satisfaction 
items. Since we have no preconceptions regarding the patterning of the fac¬ 
tor loadings for these items, we specify an exploratory factor model reflecting 
weak bidimensionality. In order to identify the model, we impose 

A21 = 1 = A42, 

and 

An = 0 = A52. 

The restrictions in (10.2) are furthermore still imposed. The resulting model, 
incorporating the identification restrictions 托 1 = 0 and 0 v$：m 1, is denoted 
A10. Note that the preferred unidimensional model A42 is nested in the two- 
dimensional contender A40. The likelihoods for M2 and are fairly similar 
and we conclude that the restrictions leading to the unidimensional model 
for life-satisfaction are acceptable and retain this model. This argument is 
corroborated by the empirical Bayes plot of the factor scores for the two 
dimensions of life-satisfaction presented in Figure 10.7. Note that the scores 
from A10 are nearly linearly related, which is also reflected in the estimated 
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correlation 厂史 21 人 —= 0.93. This result is even more compelling when we 
V(^11^22) 

remember that an exploratory factor model is used. 

We see that M2 outperforms all contenders considered above and is thus 
chosen as our retained model. Following Lazarsfeld (1950)，we now present 
the item characteristic curves of M2 in Figures 10.8 and 10.9. An item char¬ 
acteristic curve represents a plot of the conditional response distribution (see 
also Section 3.3.4), here given by 

L 一 ㈣ -,(〒)，m 

where «o = —oo and = oo. The curves represent the probability of respond¬ 
ing in a particular category s for a given item i as a function of degree of life 
satisfaction (we have reversed the life-satisfaction scale in the figures so that 
more satisfaction is associated with lower response categories). It is evident 
from Figure 10.8 that the [Friend] item functions better than the [City] item 
in the sense that the response categories discriminate well between different 
degrees of life satisfaction. We see from Figure 10.9 that the remaining items 
occupy an intermediate position in this regard. 

10.4.3 Reliabilities 

Consider next the reliabilities of the different life satisfaction items under the 
unidimensional model. Analogously to the definition introduced on page 332, 
we define the lower bounds of the reliabilities pi in the present model as: 

_ A^nA- 
Pi = 

The estimated lower bounds of the estimated reliabilities for models Ml and 
M2 are reported in the second and third columns of Table 10.12 respectively. 


Table 10.12 Estimated lower bounds of reliabilities for models Ml, A42 and 



Ml 

M2 

M4 

[CityJ 

0.19 

0.19 

0.21 

[Hobby] 

0.33 

0.33 

0.35 

[Family] 

0.49 

0.48 

0.51= 

[Friend] 

0.62 

0.63 

0.51= 

[Health] 

0.34 

0.33 

0.35 


We observe that the lower bound of the reliability of the [City] item is very 
low, the [Hobby] and [Health] items occupy an intermediate position, whereas 
the [Family] and [Friend] items stand out as most promising. The utility of the 
[City] item is questionable, since measurement error and/or an item specific 
component are of considerable magnitude. 
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we obtain 

(A 3 )Vll (A 4 ) 2 ^11 

(入 3) 2 也1 +1 (入4) 2 011 + 沒 44 

Solving the above equation for 644 yields 

044 =( 刼）， 

which is clearly a nonlinear parameter restriction. Observe that O44 is def¬ 
initely not represented by a fundamental parameter in this case, since it is 
expressed as a function of the structural parameters A 3 and A 4 . The resul¬ 
tant model, including the nonlinear parameter restriction, is denoted A44：. 
The translation table between structural and fundamental parameters for this 
model is presented in Table 10.13. Note in particular the expression for 644. 


Table 10.13 Translation table - Fundamental and structural parameters (MA). 


Structural 

Parameters 


Fundamental 

Parameters 

«12, 托 22 A 32 A 


设 1 

Kl3A23,«33 A 


分 2 

«14,«24, 扣 34,K 

44,^54 

汐3 

«15 A25A35 A 

45 5 托 55 

汐4 

«16,^26 A36 A 

46,托 56 

和 

Mi 



M2 


分 7 

M3 


^8 

fM 


^9 

M5 



入 21 


汐 11 

A 31 


汐 12 

A 4 I 


办13 

入 51 


汐14 

Ipll 


*5 

沒 44 




Regarding the estimated parameters, we report A 3 = 1.95 and A 4 = 1.67, 
from which it follows that 644 = 0.73. The estimated standard error of 644 
can, if desired, be obtained via the delta method as described in Section 8.3.1. 
In fact, this was the motivation for insisting that the function hk presented 
in Section 4.5 was one time differentiable. The estimated lower bounds of 
the reliabilities are reported in the third column of Table 10.7, where the 
superscript t= ’ indicates that equality restrictions are imposed. 

Comparing the restricted model MA with A42, the likelihood ratio statistic 
is 18.89 with one degree of freedom so that there is considerable evidence 
against the assertion of equal lower bounds for the reliabilities of the [Family] 
and [Friend] items. 


© 2004 by Chapman & Hall/CRC 









10.4- 4 Special cases 

Muraki (1990) presented an IRT model which is a special case of the model 
discussed here. We note in passing that many special cases of the categorical 
factor model are equivalent to models formulated in the IRT literature (e.g. 
Takane and de Leeuw, 1987). Estimation of this model using the EM algorithm 
is implemented in the accompanying PARS CALE software written by Muraki 
and Bock (1993). 

Regarding the Muraki-Bock model, we first observe that only the unidi¬ 
mensional model is accommodated. Moreover, in addition to our identification 
restrictions they impose © = I and, if the factor loadings are not all equal, 
ks -2 = where a is a fixed value. Muraki and Bock consider both restric¬ 
tions necessary for identification. It is evident, however, from our analysis in 
Section 5.2.3 that these restrictions are not required. The failure to recognize 
this is apparently due to an exclusive focus on the threshold structure of in¬ 
dividual items (see Muraki, 1990, p. 64)，so that the information contained 
in the correlation structure and the simultaneous threshold structure of the 
items is not used. It follows that some of their identification restrictions are 
empirically falsifiable. For instance, the restriction © = I was clearly rejected 
when we compared models M2 and MS. Similarly, we retained the unidimen¬ 
sional model M2 in favor of the bidimensional M2. A final limitation of the 
Muraki-Bock methodology is that covariates cannot be included. An impor¬ 
tant virtue of the analysis of identification outlined in Section 5.2 is that it 
can reveal that models conventionally regarded as not identified turn out to 
be so. 

10.4- 5 Conclusion 

We conclude that life-satisfaction, as measured in the GSS, appears to be a 
unidimensional phenomenon. Regarding the quality of the individual items, 
the [Friend] item appears to be the best item, both in terms of discrimination 
and reliability. The [City] item, on the other hand, stands out in a negative 
direction, since measurement error and/or an item specific component are of a 
considerable magnitude. Hence, one should seriously consider discarding this 
item. 


10.5 Summary and further reading 

The sex education example involved a complex research design with a clus¬ 
ter randomized intervention coupled with multiple ordinal measurements of 
a latent variable repeated over time. The first model considered was a ran¬ 
dom intercept proportional odds model. Other models for clustered ordered 
categorical data are reviewed in Agresti and Natarajan (2001). Instead of 
specifying a random intercept, Wolfe and Firth (2002) allow the thresholds 
to vary randomly. King et al. (2003) describe a method of making ordered 
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responses comparable across different cultures by anchoring the thresholds 
using vignettes. 

A latent growth curve model was then used to model a latent outcome, 
‘contraceptive efficacy’. Such models can also be used when the latent outcome 
has been measured by continuous response (e.g. Rabe-Hesketh et al” 2001d) 
or responses of mixed types (e.g. Gueorguieva and Sanacora ， 2003). Skrondal 
et al. (2002) extended the latent growth curve model for the sex education 
data to accommodate nonignorable dropout. 

The political efficacy and life-satisfaction examples served to illustrate the 
use of latent variable models for psychometric validation of measurement in¬ 
struments with ordinal items. We investigated in what sense the instruments 
could be called uni- or bidimensional. Item-bias was investigated using gener¬ 
alized MIMIC models. The items in the life-satisfaction example have seven 
categories. Since the number of threshold parameter proliferates as the num¬ 
ber of ordinal categories increases, we set thresholds equal across items but 
allowed the intercepts and residual variances of the latent responses to dif¬ 
fer yielding a scaled ordinal probit model. Ordinal IRT or factor models are 
discussed by Johnson and Albert (1999，Chapters 6 and 7) and Moustaki 
(2000). Bivariate multilevel ordinal response models are developed in Grilli 
and Rampichini (2003). 

In this chapter we have used cumulative models for ordinal responses (the 
proportional odds and ordinal probit model). Other possibilities include the 
adjacent category logit or continuation ratio logit. The latter is used to model 
discrete time duration data in Section 12.3. Furthermore, we did not consider 
latent class models for ordinal responses; see Vermunt (2001). 
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CHAPTER 11 


Counts 


11.1 Introduction 

In this chapter we discuss both Poisson and binomial models for counts. Since 
counts are aggregated data, obtained by summing dichotomous variables rep¬ 
resenting the occurrence of an event or presence of a feature, even apparently 
simple data structures with one count per unit can be considered as two- 
level datasets. An important consequence is that there might be unobserved 
heterogeneity leading to overdispersion. 

In the first example we consider different approaches to handling overdis¬ 
persion, including the zero-inflated Poisson and zero-inflated binomial models. 
In a second example we estimate random coefficient models for longitudinal 
count data and use various model diagnostics discussed in Section 8.6. Fi¬ 
nally, we discuss disease mapping and small area estimation using models 
with a spatial dependence structure for the random effects. 

11.2 Prevention of faulty teeth in children: Modeling 
overdispersion 

11.2.1 Introduction 

We consider dental data on 797 Brazilian children who participated in a dental 
health trial (Mendonga and Bohning, 1994) 1 . Each of six schools was assigned 
to one of six treatments aiming to prevent tooth decay: 

• [Control] no treatment 

• [Educ] oral health education 

• [Enrich] school diet enriched with ricebran 

• [Rinse] mouthrinse with 0.2% NaF solution 

• [Hygiene] oral hygiene 

• [All] all four treatments above 

The outcome was the number of decayed, missing or filled teeth (DMFT). In 
addition to school or treatment arm, there were two other covariates: 

• [Male] dummy variable for child a male 

• Ethnic group: (reference group ‘brown ’） 

1 These data can be downloaded from gllamm.org/books or the Royal Statistical Society 
Datasets Website at http:"www.blackwellpublishing.com/rss/Vol 皿 es/avl62p2.htm. 
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— [White] dummy variable for white 
— [Black] dummy variable for black 

The observed distribution of DMFT counts (marginal over the covariates) is 
shown in the first two columns of Table 11.1. An obvious model to consider is 
the Poisson model presented in Section 2.2 with mean \ii for child modeled 
as 

log(Mi) = 

where the covariates are dummies for the treatment arms, sex and ethnic 
groups. The third column of Table 11.1 shows the predicted frequencies for 
this model, again marginal over the covariates. The predicted frequencies were 
obtained by computing the probability for each possible count from 0 to 20 for 
each of the observed covariate pattern, multiplying the probability by the total 
number of observations and then aggregating over the covariates. Although 
all available covariates have been included in the model, there are still large 
discrepancies between observed and expected counts. The largest discrepancy 
is for zero DMFT (231 observed compared with 134 expected). 


11.2.2 Modeling overdispersion 

Because of the large number of observed zeros, Bohning et al. (1999) esti¬ 
mated a zero-inflated Poisson (ZIP) model (Lambert, 1992) for these data. A 
ZIP model is a mixture of two Poisson distributions, one having zero mean 
and the other having a mean that depends on covariates, giving the response 
probability 

p r(j/i|xi) = TTip^i ； m=Q) + TT 2 g(yi-, /x, = exp(x-/3)), (11.1) 


where 7Ti and 7T2 = 1—7Ti are the component weights or latent class probabilities 
and g{yi\ fii) is the Poisson probability for count yi with mean /^， 




"f exp(-"j) 
Vi '-' 


When yi > 0, the first term in (11.1) is zero, so units with counts greater 
than zero belong to the second class. However, when yi = 0, both terms are 
greater than zero because a zero count can result from a Poisson model with 
mean zero (class 1) or mean greater than zero (class 2). Therefore units with 
counts equal to zero could belong to either class. For instance, if the count is 
the number of drinks consumed in the last two weeks, a zero response could 
be from a teetotaller (who never drinks) or from a drinker who happened 
not to drink during the period. In ZIP models the latent class probability 7Ti 
determines the number of excess zeros compared with an ordinary Poisson 
model. 


It may however not be necessary to allow explicitly for an excess number of 
zeros in this way since other forms of overdispersion are also consistent with 
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larger numbers of zeros.We therefore also consider random intercept models 
ln(Mi) - x-/3 + Q, 
where ^ is either N(O,-0) or discrete, 

= ^c? c 二 1 ， • • • ， C7 

with probabilities tt c . In this example, C = 2 gave the largest likelihood and 
therefore corresponds to the nonparametric maximum likelihood estimator 
(NPMLE) (see Section 4.4.2). Note that the intercept Q varies at level 1 here, 
unlike other applications in the book where it varies at higher levels. 

Table 11.1 Observed and expected frequencies (marginal w.r.t. covariates and ran¬ 
dom effects) for Poisson models with different kinds of overdispersion 
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p 6 [Male] 0.13 (0.05) 

Ethnicity: 

07 [White] 0.09 (0.06) 

p 8 [Black] -0.14 (0.09) 

Variance 

Log odds parameter 

qI 

Location parameters 

ei 

e2 

Log-likelihood —1469.05 

AIC 2956.04 

BIC 3058.35 


Regression coefficients 
Po [Cons] 0.76 (0.07) 


0.63 (0.09) 

_ 

0.94 (0.08) 

-0.23 (0.11) 

-0.24 (0.11) 

-0.22 (0.09) 

-0.09 (0.11) 

-0.08 (0.09) 

-0.06 (0.09) 

-0.37 (0.11) 

-0.26 (0.10) 

-0.22 (0.09) 

-0.32 (0.11) 

-0.22 (0.11) 

-0.23 (0.10) 

-0.61 (0.12) 

-0.49 (0.11) 

-0.47 (0.11) 

0.13 (0.07) 

0.10 (0.06) 

0.10 (0.06) 

0.10 (0.07) 

0.09 (0.07) 

0.08 (0.06) 

-0.16 (0.11) 

-0.12 (0.10) 

-0.12 (0.10) 

0.29 (0.05) 

一 

- 


-0.86 (0.21) -1.39 (0.12) 


2885.07 

2998.68 


1.04 (0.09) 

-1406.03 

2834.06 

2959.04 


-1410.27 

2840.54 

2954.16 


The parameter estimates in Table 11.2 represent estimated adjusted log ra¬ 
tios of the expected numbers of DMFTs. For instance, for the ZIP model, the 
sex and ethnicity adjusted ratio of the expected count in the group receiving 
all treatments [All] divided by the expected count in the control group is es¬ 
timated as exp(—0.47) = 0.63. The other treatments also reduce the expected 


Table 11.2, where we have used the number of children (797) for N and the 
number of estimated parameters for v. (As discussed in Section 8.4.2, deter¬ 
mining N and v for AIC and BIC is not obvious in latent variable models.) 
The smallest AIC and BIC are shown in bold in the table. 

According to the AIC, the two-class random intercept model provides the 
best fit，but according to the BIC, the ZIP model provides a better fit. The 
discrepancy between observed and expected frequencies is particularly large 
for five DMFTs, even for the best-fitting models. 

Table 11.2 Estimates for various Poisson models 
Normal Two-class 

Poisson Intercept Intercept ZIP 

Parameter Est (SE) Est (SE) Est (SE) Est (SE) 


0909080910 

(o.(o.(o.(o.(o. 

3 9 5 0 9 
2 0 3 3 5 

o.0.0.0.0. 

----- 


咖/31/32/33/54汍 
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number of DMFTs，but [All] is the most effective and [Enrich] has a negli¬ 
gible effect. Comparing the ordinary Poisson model with the normal random 
intercept Poisson model, we see that the effects of the covariates are almost 
identical. This is because for a random intercept model with a log-link, the 
conditional effects equal the marginal effects; see Section 4.8.1. 

Use of a Poisson distribution is questionable here where the count represents 
the number of decayed, missing or filled teeth (‘successes’）out of a total of 
eight deciduous molars ( 4 trials’). We will therefore instead consider models as¬ 
suming a binomial distribution with denominator 8 for the counts. Table 11.3 
shows that a simple binomial logistic regression model does not produce a 
sufficiently large expected frequency of zero counts. 

To handle this problem we introduce a zero-inflated binomial (ZIB) model 
analogous to the ZIP model discussed above. The ZIB model is a mixture of 
two binomial distributions, one having probability parameter equal to zero 
and the other having probability parameter depending on covariates via a 
logit link, giving the response probability 

Pr(2/i| x i) = TTifl (y* ； Mi = 0) + 7T25 (yi ; logit (m) = x-/3), (11.2) 

where g(yi； is now a binomial probability with parameter \Xi and denomi¬ 
nator 8, 

Similarly, we can estimate binomial logistic regression models with normal or 
nonparametric random intercepts. The estimates for these models are given 
in Table 11.4 and the expected frequencies in Table 11.3. 


Table 11.3 Observed and predicted frequencies (marginal w.r. t. covariates and ran¬ 
dom effects) for binomial logistic regression models with different kinds of overdis¬ 
persion 


DMFT 

Count 

Observed 

Frequency 


Predicted Frequencies 


Binomial 

Normal 

Intercept 

Three-class 

Intercept 

ZIB 

0 

231 

107.21 

207.22 

226.73 

227.88 

1 

163 

230.82 

202.14 

170.45 

120.50 

2 

140 

233.84 

149.66 

143.67 

174.21 

3 

116 

145.04 

101.13 

108.65 

150.21 

4 

70 

59.90 

64.65 

78.02 

84.21 

5 

55 

16.76 

38.81 

45.52 

31.33 

6 

22 

3.08 

21.09 

18.70 

7.53 

7 

0 

0.34 

9.51 

4.71 

1.06 

8 

0 

0.02 

2.78 

0.55 

0.07 
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Table 11.4 Estimates for various binomial logistic regression models 



Binomial 

Normal 

Intercept 

Three-class 

Intercept 

ZIB 

Parameter 

Est (SE) 

Est (SE) 

Est (SE) 

Est (SE) 

Regression coefficients 

0o [Cons] -1.00 (0.08) 

-1.29 (0.14) 

- 

-0.69 (0.09) 

丄 reatmem;: 

/3i [Educ] 

-0.32 (0.10) 

-0.35 (0.17) 

-0.41 (0.17) 

-0.32 (0.11) 

/?2 [Enrich] 

-0.12 (0.10) 

-0.15 (0.17) 

-0.16 (0.16) 

-0.09 (0.11) 

Ps [Rinse] 

-0.47 (0.10) 

-0.57 (0.16) 

-0.49 (0.16) 

-0.26 (0.11) 

th [Hygiene] 

-0.40 (0.10) 
-0.76 (0.11) 

-0.50 (0.18) 

-0.36 (0.17) 

-0.29 (0.11) 

05 [All] 

-0.91 (0.17) 

-0.80 (0.17) 

-0.59 (0.13) 

/3 6 [Male] 

0.17 (0.06) 

0.20 (0.10) 

0.16 (0.09) 

0.13 (0.07) 

Ethnicity: 
p 7 [White] 

0.13 (0.07) 

0.14 (0.11) 

0.13 (0.10) 

0.11 (0.08) 

[Black] 

-0.18 (0.10) 

-0.23 (0.16) 

-0.17 (0.15) 

-0.14 (0.12) 

Variance 


- 

1.05 (0.12) 

- 

- 

Log odds parameters 

S h 


-1.01 (0.40) 

-1.16 (0.10) 

e 2 0 

- 

- 

0.27 (0.30) 

- 

Location parameters 
ei 


-32.09* 

—oo - 

€2 

- 

- 

-1.53 (0.26) 

- 

es 

- 

- 

-0.06 (0.18) 

- 

Log-likelihood 

-1546.78 

-1409.39 

-1397.53 

-1431.09 

AIC 

3111.56 

2838.78 

2821.07 

2882.19 

BIC 

2113.81 

2952.40 

2968.77 

2995.80 


* Boundary solution 


It is interesting to note that the binomial logistic normal random inter¬ 
cept model fits considerably better than the Poisson normal random intercept 
model (log-likelihood —1409.39 versus —1432.53 with the same number of pa¬ 
rameters), mainly because the latter produces quite large expected frequencies 
for large counts whereas the former cannot generate any counts exceeding 8. 
The binomial logistic NPML solution has three classes (or masses) and this 
model fits best among all models considered according to the AIC，but not ac¬ 
cording to the BIC，according to which the normal random intercept binomial 
logistic model would be chosen. 

The parameter estimates for the binomial logistic models represent esti- 
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mated log odds ratios. For instance, for the three-class model, the sex and 
ethnicity adjusted ratio of the odds of having a DMFT in the [All] group 
divided by the odds of having a DMFT in the control group is estimated as 
exp(—0.80) = 0.45. Again, all treatments appear to be beneficial, although the 
effect of [Enrich] appears to be negligible. 

Comparing the estimates for random intercept binomial models, which rep¬ 
resent conditional effects given the random intercept, with the estimates for 
the ordinary binomial model, which represent marginal effects, the attenu¬ 
ation discussed in Section 4.8.1 is evident. Note that the estimates for the 
binomial NPML model represent a boundary solution since the first class has 
an estimated location of —32.09, corresponding to a binomial probability pa¬ 
rameter of virtually zero. The log-odds of belonging to this class is estimated 
as —1.01，similar to the log-odds of —1.16 of belonging to the zero-probability 
class in the ZIB model (the corresponding probabilities are 0.27 and 0.24). 

11.3 Treatment of epilepsy: A random coefficient model 

11.3.1 Introduction 

The longitudinal epilepsy data from Leppik et al. (1987), have previously 
been analyzed by Thall and Vail (1990), Breslow and Clayton (1993), Lindsey 
(1999), Diggle et al (2002) and many others. The data 2 come from a random¬ 
ized controlled trial comparing an anti-epileptic drug with placebo. For each 
patient the number of epileptic seizures was recorded during a baseline period 
of eight weeks. Patients were then randomized to treatment with progabide or 
to placebo (in addition to standard chemotherapy). The outcomes are counts 
of epileptic seizures during the two weeks before each of four consecutive clinic 
visits. Breslow and Clayton considered the following covariates: 

• [Lbas] logarithm of a quarter of the number of seizures in the eight weeks 
preceding entry into the trial 

• [Treat] dummy variable for treatment group 

• [LbasTrt] interaction between two variables above 

• [Lage] logarithm of age 

• [V4] dummy for visit 4 

• [Visit] time at visit, coded as —0.3, —0.1, 0.1 and 0.3 

11.3.2 Modelling repeated counts 

Model II in Breslow and Clayton is a log-linear (Poisson regression) model 
including all the covariates listed above except [Visit] as well as a random 
intercept for subjects. The seizure count yij for subject j at visit i is assumed 
to be conditionally Poisson distributed with mean 叫 modeled as 

v ij = log(/%) = x ij/3 + Cij- 

2 The data can be downloaded from gllaimn.org/books 
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-1.25 (1.2) 
0.87 (0.14) 
-0.91 (0.41) 
0.33 (0.21) 
0.47 (0.35) 
-0.16 (0.05) 


Random effects 

0.53 (0.06) 

^21 

Log-lik. 



2.11 (0.22, 0.21) 

-1.27 (1.2) 

0.88 (0.13, 0.11) 

0.87 (0.14) 

-0.93 (0.40, 0.40) 

-0.91 (0.41) 

0.34 (0.20, 0.20) 

0.33 (0.21) 

0.48 (0.35, 0.30) 

0.46 (0.36) 

-0.16 (0.05, 0.07) 

-0.26 (0.16) 

0.50 (0.06, 0.06) 

0.52 (0.06) 
0.74 (0.16) 


-0.01 (0.03) 


2.10 ( 0 . 22 , 0 . 21 ) 
0.89 (0.13, 0.11) 


0.34 (0.20, 0.20) 
0.48 (0.35, 0.33) 


0.50 (0.06, 0.06) 
0.73 (0.16, 0.16) 
0.00 (0.09, 0.11) 


' SE_r denotes ‘robust’ standard errors based on the sandwich estimator 

Rabe-Hesketh et al. (2002) showed that the parameters of these models can 
be reliably estimated using adaptive quadrature whereas ordinary quadra¬ 
ture is extremely unstable (see Section 6.3.2 for a description of quadrature 
methods). We therefore use adaptive quadrature with 15 points for Model 
II and 8 points per dimension for Model IV. Maximum likelihood estimates 
and standard errors are shown in Table 11.5 together with PQL-1 estimates 
(see Section 6.3.1) reported by Breslow and Clayton. The estimates produced 
by adaptive quadrature and PQL-1 were very similar (the constants are not 
comparable since we have centered the predictors around their means), which 
is reassuring since PQL-1 is expected to work well in this particular case. 


The subject-specific random intercept Cl 』is assumed to have a normal distri¬ 
bution with zero mean and variance 矽 u. 

Model IV in Breslow and Clayton includes the predictor [Visit] instead of 
[V4] and has a random slope of [Visit] in addition to the random intercept, 

v ij = l 0 g(/%) = + Clj + 

The intercept and slope are assumed to have a bivariate normal distribution 
with variances -0H and ^ 22 ? respectively, and covariance ^21 • 

Table 11.5 Parameter estimates and standard errors for Models II and IV using 
PQL-1 (Breslow and Clayton, 1993) and maximum likelihood using adaptive Gaus¬ 
sian quadrature 


Model II Model IV 

PQL-1 AGQ PQO AGQ 

Est (SE) Est (SE, SE k ) + Est (SE) Est (SE, SEr^ 



Fix/30A/32&A^5 /^6 
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Note that the parameter estimates for Model II reported in Yau and Kuk 
(2002) using 16-point ordinary quadrature are considerably different from the 
PQL-1 and adaptive quadrature estimates (for example the treatment effect 
estimate is —0.52)，suggesting that their solution is not reliable. We also re¬ 
port ‘robust’ standard errors SE_r based on the sandwich estimator (see Sec¬ 
tion 8.3.3) for the maximum likelihood estimates. 

Figure 11.1 shows growth curves of the predicted epilepsy counts over visits 
for each subject by treatment group for Model IV together with observed 
counts shown as circles. Here, the predicted counts are the posterior means of 
the exponential of the linear predictor, 

E^[exp(x^/3 + Cij + C2jZ ij )\y j ,^ j ,z j ]. 

As was pointed out in Section 7.8, we must integrate the above exponential 
function with respect to the posterior distribution to obtain the expectation 
and cannot simply plug in the empirical Bayes predictions Cij and in the 
exponential. 

The variability in slopes is most apparent for subjects with larger counts. 
The observed counts of subject 227 deviate substantially from the predicted 
counts, particularly at visit 3. 


11.3.3 Model diagnostics 

We now consider model diagnostics of the kind discussed in Section 8.6 for 
Model II which only included a random intercept. 

Normality of the random effects can be assessed by estimating the model 
with a nonparametric random intercept distribution. The NPML solution (not 
mean-centered) has six masses at —30, 1.00, 1.75, 2.03, 2.37 and 2.90 with 
probabilities 0.02, 0.15, 0.45, 0.22, 0.15 and 0.09, respectively (note that Yau 
and Kuk (2002) only found four masses). The five-mass solution was a bound¬ 
ary solution with a very large negative location for one of the classes, giving 
an expected count of zero for that class. To avoid a very flat log-likelihood 
and obtain the NPMLE, this location was fixed at —30. Figure 11.2 shows the 
predicted counts (exponentials of the locations) when all the mean-centered 
covariates are zero. The estimated distribution is highly asymmetric (also on 
the log scale), suggesting that the normality assumption may be dubious. 
However, the log-likelihood based on normality, —665.29, is not much lower 
than that from NPMLE, —655.06, when taking into account that 9 extra pa¬ 
rameters are estimated. 

The existence of masses at predicted counts of 0 and 18 suggests that there 
maybe outlying subjects who could be influential. To assess influence we there¬ 
fore computed Cook’s distances (see Section 8.6.6) for the model assuming a 
normal random intercept. We also obtained standardized predictions of the 
random intercepts, so-called standardized empirical Bayes residuals, using the 
approximate sampling standard deviation in equation (7.8) on page 232 for 
the standardization. The standardized residuals were computed using both 
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Figure 11.2 Estimated probabilities tt c and expected counts expfS^) for components 
c=l,..., 5 when = 0 

the parameter estimates 沒 (_j) when subject j is deleted and the estimates 
based on the full sample, 9. 

These diagnostics are reported in Table 11.6 for the six subjects whose 
Cook’s distances exceed 1 as well as two subjects with lower distances but 
with standardized residuals exceeding 2 in absolute value. For these subjects 
we also give DFBETAS for [Treat], [V4] and the random intercept standard 
deviation These were obtained by actually deleting the subject and re- 
estimating the parameters instead of using the approximate one-step method 
described in Section 8.6.6. We also show the corresponding responses pre¬ 
treatment seizure count divided by 4 [Base] and treatment group [Treat]. 

Since there were 59 subjects, the expected number of standardized residuals 
exceeding an absolute value of 2.39 is about one if the residuals are standard 
normally distributed. Therefore, subjects with standardized empirical Bayes 
predictions exceeding this value may be considered outliers (shown in bold 
for subjects 225 and 232). As expected, subjects 225 and 232 therefore also 
have large DFBETAS for the random intercept standard deviation and [Treat] 
(since [Treat] is a between-subject covariate). Subjects 135, 227, 206 and 112 
are also possible outliers in the random intercept distribution and have consid¬ 
erable influence on the estimate of the random intercept standard deviation. 
Subject 207 has a relatively small residual and small influence on the stan¬ 
dard deviation although the responses are extremely high. This is probably 
because the baseline count (the log of which is a covariate) was also consid¬ 
erably higher than for subjects 225 and 232. Nevertheless, removing subject 
207, who was in the treatment group, would substantially reduce the estimate 
of the treatment effect as reflected in the large DFBETAS for [Treat]. 
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Table 11.6 Influence statistics (Cook’s D and DFBETAS) and various residuals 









Cook’s 

DFBETAS 

Normal 

NPML 

Subj. 

[Base] 





[Treat] 

D 

[Treat] 

[V4] 

vW 



OJl 

0；6 

126 

13.0 

40 

20 

23 

12 

0 

1.10 

-0.02 

0.52 

0.02 

1.04 

0.89 

0.00 

0.00 

135 

2.5 

14 

13 

6 

0 

0 

1.52 

0.39 

0.40 

-0.33 

2.23 

1.97 

0.00 

0.99 

227 

13.8 

18 

24 

76 

25 

0 

1.46 

-0.14 

0.39 

-0.33 

2.19 

1.93 

0.00 

1.00 

207 

37.8 

102 

65 

72 

63 

1 

1.68 

0.58 

0.24 

-0.16 

1.97 

1.37 

0.00 

1.00 

225 

5.5 

1 

23 

19 

8 

1 

1.05 

-0.23 

0.18 

-0.44 

2.47 

2.26 

0.00 

1.00 

232 

3.3 

0 

0 

0 

0 

1 

1.57 

0.34 

0.00 

-0.44 

-2.92 

-2.77 

0.94 

0.00 

206 

12 

11 

0 

0 

5 

0 

0.52 

0.13 

-0.08 

-0.32 

-2.10 

-1.91 

0.00 

0.00 

112 

7.75 

22 

17 

19 

16 

1 

0.72 

-0.03 

0.02 

-0.32 

2.26 

2.07 

0.00 

1.00 
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Furthermore, removing subject 207 virtually eliminates the interaction effect 
of [LbasTrt]. Subjects 126, 135 and 227 all have large influence on the estimate 
of [V4]. This is because these subjects had a marked drop in epilepsy count 
at the fourth visit. 

For the NPMLE, we now consider the posterior probabilities of the smallest 
and largest locations, denoted a；i and loq. All subjects in the table except 126 
and 206 have posterior probabilities close to 1 of belonging to one of these 
extreme classes; subjects 126 and 206 have posterior probabilities of 0.99 and 
1.00 of belonging to classes 5 and 2, respectively. For subjects not shown in the 
table, the largest value of coq was 0.03 and pi was zero for everyone. Simulations 
could be used to attach ^values to the various measures of 4 outlyingness' 


11.4 Lip cancer in Scotland: Disease mapping 

11.4.1 Introduction 

We now consider models for disease mapping or small area estimation. Clayton 
and Kaldor (1987) presented and analyzed data on lip cancer for each of the 
56 (pre-reorganization) counties of Scotland over the period 1975-1980. These 
data 3 have also been analyzed by Breslow and Clayton (1993) and Leyland 
(2001) among many others. The number of observed lip cancer cases, the 
expected number of cases and crude standardized mortality ratios (SMR) are 
presented in Table 11.7. 

Table 11.7 ： Observed and expected numbers of lip cancer cases and 
various SMR estimates (in percent) for Scottish counties 


County 

# 

Obs 

°j 

Sxp 

Cj 

Crude 

SMR 

Norm 

Predicted SMRs 

Spatial IGAR 
NPML Est 95% Cl 

Skye.Lochalsh 

1 

9 

1.4 

652.2 

470.8 

342.6 

412.3 

305.5, 

492.2 

Banf. Buchan 

2 

39 

8.7 

450.3 

421.8 

362.4 

430.4 

408.4, 

444.3 

Caithness 

3 

11 

3.0 

61.8 

309.4 

327.1 

351.4 

306.7, 

394.9 

Berwickshire 

4 

9 

2.5 

355.7 

295.2 

321.6 

230.3 

162.9, 

281.8 

Ross. Cromarty 

5 

15 

4.3 

352.1 

308.5 

327.6 

321.8 

277.2, 

357.3 

Orkney 

6 

8 

2.4 

333.3 

272.1 

311.1 

332.4 

283.4, 

381.9 

Moray 

7 

26 

8.1 

320.6 

299.9 

322.2 

303.2 

275.9, 

324.4 

Shetland 

8 

7 

2.3 

304.3 

247.8 

292.5 

311.1 

274.5, 

353.5 

Lochaber 

9 

6 

2.0 

303.0 

238.9 

280.1 

231.2 

190.1, 

271.0 

Gordon 

10 

20 

6.6 

301.7 

279.1 

319.9 

285.5 

261.3, 

304.4 

W.Isles 

11 

13 

4.4 

295.5 

262.5 

315.5 

299.1 

264.7, 

335.6 

Sutherland 

12 

5 

1.8 

279.3 

219.2 

254.3 

304.1 

261.7, 

354.9 

Nairn 

13 

3 

1.1 

277.8 

198.4 

222.7 

266.2 

215.0, 

317.0 

Wigtown 

14 

8 

3.3 

241.7 

210.9 

249.6 

159.7 

115.0, 

194.6 

NE.Fife 

15 

17 

7.8 

216.8 

204.6 

245.3 

173.2 

140.5, 

194.0 

Kincardine 

16 

9 

4.6 

197.8 

178.9 

171.4 

190.9 

167.9, 

214.5 


3 The data can be downloaded from gllaimn.org/books 
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Table 11.7 ： - continued 


186.9 151.9 

167.5 154.7 

162.7 154.2 

157.7 149.0 

153.0 147.8 

136.7 135.0 

125.4 123.3 

124.6 122.9 

122.8 121.6 

120.1 119.1 

115.9 116.3 

111.6 111.4 

111.3 111.1 

107.8 108.5 

105.3 107.2 

104.2 109.1 

99.6 102.7 

93.8 97.3 

89.3 92.2 

89.1 92.5 

86.8 89.7 

85.6 89.5 

83.3 89.4 

75.9 85.6 

53.3 59.1 

50.7 57.9 

46.3 68.8 

41.0 50.6 

37.5 40.8 

36.6 53.2 

35.8 57.9 

32.1 48.5 

31.6 33.8 

30.6 39.8 

29.1 63.1 

27.6 61.1 

17.4 46.4 

14.2 40.8 

0.0 43.2 

0.0 64.9 


163.2 207.7 

136.7 126.9 

128.4 210.4 

130.3 145.0 

117.1 138.8 

116.4 145.0 

116.4 111.5 

116.7 79.7 

116.4 123.9 

116.3 108.4 

115.5 99.8 

115.9 108.7 

116.3 116.9 

115.9 78.9 

111.5 91.2 

109.4 107.4 

113.1 90.4 

112.9 74.2 

114.1 91.1 

112.4 84.4 

113.3 81.9 

109.4 63.2 

104.9 78.1 

94.5 61.5 

40.6 56.6 

40.9 61.8 

65.5 75.1 

37.5 50.1 

36.2 46.6 

42.2 56.7 

49.7 51.2 

38.8 45.4 

36.2 37.0 

36.2 57.4 

57.8 54.0 

55.4 48.3 

40.6 42.2 

37.8 43.2 

40.8 73.0 

60.6 71.3 


169.7, 247.8 

102.9, 153.5 

179.7, 252.3 

120.2, 172.6 

125.2, 150.1 
139.0, 156.8 

96.5, 126.1 

63.7, 95.0 

115.1, 132.1 

94.6, 118.8 

79.9, 115.2 

94.7, 123.6 

105.4, 129.5 

64.9, 91.1 
74.0, 108.4 

82.6, 131.7 

75.1, 104.3 

60.8, 85.7 

79.1, 104.5 

73.7, 95.7 

71.4, 90.9 

50.1, 74.1 
66.0, 90.8 

50.4, 73.0 
49.6，64.8 

53.8, 70.7 
63.0, 89.8 

43.5, 57.6 

41.8, 55.9 

47.5, 67.6 

40.6, 61.5 

38.3, 54.8 

33.8, 41.2 

46.3, 77.9 

43.4, 66.8 

37.9, 60.3 
34.5，52.7 

33.9, 53.6 

55.1, 97.0 
56.0, 87.1 


Badenoch 17 

Ettrick 18 

Inverness 19 

Roxburgh 20 

Angus 21 

Aberdeen 22 

Argyll. Bute 23 

Clydesdale 24 

Kirkcaldy 25 

Dunfermline 26 

Nithsdale 27 

E 丄 othian 28 

Perth.Kinross 29 

W.Lothian 30 

Cumnock-Doon 31 
Stewart ry 32 

Midlothian 33 

Stirling 34 

Kyle.Carrick 35 

Inverclyde 36 

Cunninghame 37 

Monklands 38 

Dumbarton 39 

Clydebank 40 

Renfrew 41 

Falkirk 42 

Clackmannan 43 

Motherwell 44 

Edinburgh 45 

Kilmarnock 46 

E. Kilbride 47 

Hamilton 48 

Glasgow 49 

Dundee 50 

Cumbernauld 51 

Bearsden 52 

Eastwood 53 

Strathkelvin 54 

Tweeddale 55 

Annandale 56 


Source: Clayton and Kaldor (1987) 

The expected number of lip cancer cases is based on the age-specific lip 
cancer rates for Scotland and the age distribution of the counties. The SMR 


Predicted SMRs 

Obs Exp Crude Spatial IGAR 

County # Oj ej SMR Norm. NPML ~^st 95% Cl 


.i2.5.4.5 .7 .86.55.oo .4 .2.8.9.o531 .7 4.23.8003.6 726 .3 7 .64.6 7020 
-54.5.4.10.22.8.5.15.12.6.9.14.10.4.2.7.8.12.10.12.9.7.5.18 .15.4.14.0.8.5.9.8.19.3.3.5.7.4.1 
<17971611171915710161153781191186410826193238611110c 
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for a county is defined as the ratio of the mortality rate to that expected if 
the age-specific mortality rates were equal to those of a reference population 
(e.g. Breslow and Day, 1987). The crude estimate of the SMR for county j is 
obtained using 

SMR,- = 

e 3 

where Oj. is the observed number of cases and ej the expected number. These 
estimates are shown in Table 11.7 under 4 Crude SMR’ and a map of the crude 
SMRs is shown in Figure 11.3. 



Figure 11.3 Crude SMR in percent 

There are two important limitations of crude SMR’s. First, crude SMR es¬ 
timates for counties with small populations are very imprecise. Second, crude 
SMR’s do not take into account that geographically close areas tend to have 
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where ln(ej) is an offset (a covariate with regression coefficient set to 1)，and Q 
is a random intercept representing unobserved heterogeneity between counties. 
The empirical Bayes predictions of the SMRs, 

SMRj [exp(/?o + Cj)\°j^ e i]> 

will then be shrunken towards the average SMR，providing more stable values 
for counties with smaller populations. 

The parameter estimates assuming Q ~ N(0, are given in Table 11.8 
under 4 Indep. Normal’. Empirical Bayes predictions of the SMRs assuming 
Q 〜 N(0，^) are given under ‘Norm.’ in Table 11.7 and displayed as a map in 
Figure 11.4. 


Table 11.8 Estimates for different random intercept models for Scottish lip cancer 

diQitCL 



Indep. 

Indep. 

Spatial 


Normal 

NPML 

IGAR 


Est (SE) 

Est (SE) 

Est (SE) 

Po [Cons] 

0.08 (0.12) 

0.08 (0.12) 

0.09 (0.07) 


0.76 (0.10) 

0.80+ (-) 

0.74 (0.14) 


t derived from discrete distribution 


Instead of assuming that the random intercept is normally distributed, we 
also used nonparametric maximum likelihood estimation (NPMLE) where Q 
is discrete with locations Q = e c , c = 1.... ,C, and probabilities 7r c , and 
C is determined to maximize the likelihood (see Section 6.5). The NPML 
solution had (7 = 4 masses with locations /?o + e c equal to —1.02, 0.15 ， 1.13 
and 1.35 (corresponding to SMRs of 36.2%, 116.4%, 308.8% and 384.8%) 
and estimated probabilities tt c equal to 27.5%, 47.9%, 18.5% and 6.2%. The 
parameter estimates are shown in Table 11.8 under 4 Indep. NPML’. Predicted 
SMRs based on these NPML estimates are under given ‘NPML’ in Table 11.7. 

11.4.3 Spatially correlated random intercepts 

It is likely that unobserved risk factors, many of which may be environmen¬ 
tal, are spatially correlated. We will therefore now consider spatial models 
where the random effects Q are allowed to be correlated across neighboring 
counties. Clayton and Kaldor (1987) considered a conditional Gaussian au¬ 
toregressive model (Besag, 1974), but this specification is problematic for the 
case of so-called irregular maps where the number of neighbors varies. For this 
common situation the intrinsic autoregressive Gaussian (IGAR) model (Besag 
et al” 1991; Bernardinelli and Montomoli ， 1992) is deemed more appropriate. 
In this model the conditional distribution of a random intercept for a county, 
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given the the random intercepts of its contiguous neighbors, does not depend 
on the random intercepts of non-neighboring counties. Such models are known 
as Markov random fields. 

The model can be specified as 

MO) « exp^](Cj -Ci) 2 /( 2 ^), 

i 〜 j 

where i 〜 j indexes counties contiguous to county j. Here the conditional 
expectation of Q is the mean of ^ in the neighboring counties, 

E (oiCi ,* + j) -- XI 

Note that the unconditional expectation of Q is not specified. The conditional 
variance is inversely proportional to their number rrij, 

Var(Cj|Ci,« j) = ^ 7 - 

The intrinsic autoregressive Gaussian model has been estimated for the lip 
cancer data using PQL-1 (Breslow and Clayton, 1993) and MCMC (Spiegel- 
halter et al, 1996a). Since the random effects for all counties are correlated 
with each other (except for islands), methods based on numerical integration 
are not feasible due to the excessive dimensionality of the integrals. We there¬ 
fore propose the following iterative algorithm which strongly resembles the 
AIP algorithm described in Section 6.11.5: 

• Set initial offsets to zero, 

a ? = °- 

• In iteration k: 


1. Estimate a Poisson random intercept model with linear predictor 

= ln(ej) + /?o + a) + Uj n 〜 N(0, *0) 

giving parameter estimates 0 k with estimated covariance matrix Cov(0) k . 

2. Sample parameters 0 k from their approximate sampling distribution, 

6> fc ~N(0 fe ,Cov(0) fc ). 


3. Obtain the posterior means and standard deviations of the random ef¬ 
fects given the parameters G k , 

苟 =E( Ui | yj ,0 fc ) 

rj = Vainly〆). 

4. Sample the random effects from their approximate posterior distribution 

u 1 ] 
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5. Obtain as 

c , = akj+ukj ^j- 

6 . Calculate the mean of the neighbors，values of C^ 



7. Update the offsets to 

s fc+1 = 

i=i 

Here the offset represents the mean of the random effect of cluster 
j for the current iteration, computed from the means of the neighbors, ran¬ 
dom effects from the previous iterations. (The mean-centering in step 7 is 
required to identify the intercept parameter /?o since the intrinsic autoregres¬ 
sive Gaussian model does not specify a mean for the random effects.) In step 

I, Q = o!j-\- Uj-^= has mean a》and variance ^/rrij as required. The random 
effect Uj is independently distributed, and the model can therefore be esti¬ 
mated using any software for random intercept models, here using adaptive 
quadrature in gllamm. In step 2, we sample from the approximate sampling 
distribution of the parameters to reflect the uncertainty of estimation. Hence, 
our approach is more elaborate than conventional empirical Bayes where pa¬ 
rameter estimates are plugged in and this uncertainty is ignored. In steps 3 
and 4， we sample the random effects from the normal approximation to the 
posterior distribution. In step 5 we form the corresponding ^ and combine 
them in step 6 to form the required offsets for the next iteration. In step 7 
the offsets are mean-centered to allow estimation of the intercept /?o. 

The algorithm was repeated until the estimates ^ appear to come from a 
stationary distribution (the 4 burn-in’ period). This took only about 10 iter¬ 
ations. The algorithm was then repeated 500 times to obtain the mean and 
variance of the estimates as described in Section 6.11.5. The resulting esti¬ 
mates are shown in Table 11.8. 

In step 3, the posterior mean SMRs were also obtained. Their means and 
95% confidence intervals based on the 2.5 and 97.5 percentiles of the 500 
replications are given in the last three columns of Table 11.7. The point es¬ 
timates are shown in the map in Figure 11.5 and the confidence intervals in 
Figure 11.6. Here the hollow circles are the crude SMR estimates and the 
vertical line is the estimate of the mean SMR, 100exp(3o)，using the intrinsic 
Gaussian autoregressive model. 

II. 4-4 Introducing a county-level covariate 

Breslow and Clayton (1993) suggest investigating the effect of the county-level 
covariate [Agric], the percent of the work force in each county employed in 
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Figure 11.5 SMRs and 95% Cl based on intrinsic autoregressive Gaussian model 
without covariate 

agriculture, fishing or forestry divided by ten. [Agric] is believed to have an 
effect on lip cancer incidence since all three occupations involve outdoor work 
and exposure to sunlight is known to be the main risk factor for lip cancer 
(Kemp et al” 1985). Following Breslow and Clayton (1993) we consider the 
model 

ln(Mi) = InCej) + /3o + PiXj + Cj, 

where [Agric] is represented by Xj. 

Estimates for an intrinsic Gaussian autoregressive model including the effect 
of [Agric] using the data augmentation algorithm are given in Table 11.9. The 
estimates are very close to those using PQL-1 (Breslow and Clayton, 1993) 
and MCMC (from the BUGS Examples Manual (Spiegelhalter et al, 1996c, 
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Volume 2, Chapter 11) which are also given in the table. The priors for the 
Bayesian model were /?i 〜 N(0,10 5 ) and Cj 〜 N(0, 畛 ） and the hyperprior was 
4 〜 IG(0.001 ， 0.001); see page 210 for the inverse gamma (IG). No constant 
was included for the MCMC estimates because Spiegelhalter et al. (1996c) 
did not set the mean of the random effects to zero. Although there appears 
to be an effect of [Agric] ， conclusions regarding etiology should be made with 
extreme caution when based on ecological or aggregated data such as these. 

In Figure 11.7 we display a map of SMRs based on the intrinsic Gaus¬ 
sian autoregressive model with [Agric], estimated by the data augmentation 
algorithm. 


Table 11.9 Estimates for intrinsic Gaussian autoregressive model with covariate for 
Scottish lip cancer data 



PQL-1 

MCMC 

Data 

augment. 

Est (SE) 

Est (SE) 

Est (SE) 

/3o [Cons] 

-0.18 (0.12) 

一 

-0.21 (0.11) 

/3i [Agric] 

0.35 (0.12) 

0.37 (0.11) 

0.36 (0.11) 


0.73 (0.13) 

0.69 (0.12) 

0.65 (0.14) 


11.5 Summary and further reading 

We have addressed the problem of overdispersion, using zero-inflated models as 
well as random intercept models with both normal and nonparametric random 
effects distributions. Notwithstanding their genesis in the statistical literature 
(Lambert, 1992), recent research and applications of models for zero-inflated 
count data have taken place mostly in econometrics (see Cameron and Trivedi 
(1998) for references). Hall (1997) and Dobbie and Welsh (2001) discuss zero- 
inflated Poisson models for clustered data. 

We then used random intercept and random coefficient models for longi¬ 
tudinal epileptic seizure data and used diagnostics to check model assump¬ 
tions. Comprehensive treatments of various types of count data modeling, 
also including random effects, are provided in the econometric monographs of 
Winkelmann (2003) and Cameron and Trivedi (1998). 

Various models were used for mapping of lip cancer rates, including one with 
spatially correlated random effects. An alternative model for spatial depen¬ 
dence is a conditional autoregressive model which conditions on the observed 
counts in neighboring areas. Biggeri et al. (2000) include random effects in 
such a model and use nonpar ametric maximum likelihood estimation. Knorr- 
Held and Best (2001) propose a Bayesian model for two diseases where the 
response model is essentially a common factor model. Application of small 
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sing intrinsic autoregressive Gaussian model with [Agric] as 


[mapping seems to be most common in medical research, 
sease rates as the archetype. However, other applications 
unemployment, crime rates and party preference. A nice 
Pfeffermann (2002). Lawson et al. (2003) discuss disease 
MLwiN and WinBUGS softwares. 

[factor models in this chapter. For a discussion and appli- 
dels for counts we refer to Wedel et al. (2003). Arminger 
I consider a factor model for mixed responses including 

ipplication involving counts in Section 14.5 of our chapter 
es and mixed responses. Specifically, we model the effect 



of physician advice on the number of alcoholic beverages consumed, taking 
unobserved heterogeneity and endogeneity into account. 
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CHAPTER 12 


Durations and survival 


12.1 Introduction 

In Section 2.5 we considered models for durations, distinguishing between 
continuous and discrete-time durations. In this chapter we discuss clustered 
survival or duration data often referred to as 4 multivariate’ duration data. It 
is useful to distinguish between Single events，and ‘multiple events，clustered 
duration data. 

Single events clustered duration data comprise durations for one event (e.g. 
death from lung cancer) for each unit (e.g. subject) in different clusters (e.g. 
families). It is typically assumed that the event is ‘absorbing’ (can only occur 
once) and that the hazard for a unit is independent of the timing of events for 
other units in the cluster, although the hazards are dependent among units in 
a cluster. 

In contrast, multiple events clustered duration data comprise durations for 
several events per cluster (e.g. subject). In this case it may be reasonable 
to expect that the hazard for one of the events for a particular subject could 
depend on the timing of other events for the subject. If the events are different 
and absorbing (e.g. death from different kinds of cancer) the events are of 
multiple types. The multiple events are called recurrent if the same event (e.g. 
occurrence and subsequent recurrence of colon cancer) is observed repeatedly. 
Complex duration data may of course include recurrent events of multiple 
types. 

Various aspects of model specification for multiple events clustered duration 
data are discussed in Section 12.2. In Section 12.3 we apply models for discrete 
time single events clustered duration data. The data are on children clustered 
in schools and the event of interest is first experimentation with cigarettes. 
In Section 12.4 we apply models for multiple events clustered survival data in 
continuous time. A randomized clinical trial is considered where the responses 
are times to onset of angina in repeated exercise tests. Continuous random 
effects and factors are considered as well as discrete latent variables. 


12.2 Modeling multiple events clustered duration data 

The modeling of single events clustered duration data involves fairly straight¬ 
forward inclusion of latent variables such as random effects and factors (possi¬ 
bly varying at several levels) in the duration models discussed in Section 2.5. 
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For multiple events clustered survival data, however, a number of additional 
considerations are important in choosing an appropriate model: 

1. Are the events of different types? 

2. Are the events ordered so that a unit becomes at risk for the next event 
only if the previous event has occurred? An example is HIV infection and 
AIDS. A less clear-cut example is malaria where the biological processes 
leading to the first infection are very different from subsequent recurrences. 

3. Does the risk of an event depend on whether another event has already 
occurred for the same unit ( 4 state dependence ’)？ 

4. Do the risks of all events evolve in parallel from a common time origin 
or do they evolve sequentially, starting from the previous event-time? For 
instance, risks of different types of infection may evolve from a common 
origin such as time of surgery. In contrast, for durations from repeated 
experiments such as time from starting exercise (after resting) to developing 
angina the risk would be expected to evolve from the beginning of each 
experiment (see Section 12.4 for an application). 

Recall from Section 2.5 that both the proportional hazards model for contin¬ 
uous time durations and the discrete-time hazards models can be formulated 
in terms of risk sets. A difficulty with recurrent event data is that there are 
many different ways of constructing these risk sets; which are appropriate is 
determined by the considerations listed above. 

Kelly and Lim (2000) suggest that the following aspects are useful to con¬ 
sider when constructing risk sets: 

• Risk interval formulations 

— Total time: the clock continues to run from start of observation, undis¬ 
turbed by event occurrences. 

— Counting process: the clock continues to run but a unit becomes at risk 
for its kth. event only after having experienced the fc — 1th event. 

— Gap time: the clock is reset to zero at each event-time for a unit. 

The difference between risk intervals in terms of total time, counting process 
and gap time is best illustrated using a simple example. Consider three 
subjects j = A,B, C with possibly recurrent durations Uj, from start of 
observation to the ith event (5ij = 1) or censoring (5ij = 0). The dataset is 
presented in Table 12.1, where we see for instance that A experiences the 
first event at time 2, the second at time 5 and is eventually censored at 
time 14. Risk intervals for the data are presented in Figure 12.1 for each of 
the three formulations. 

The choice of risk interval formulation is guided by consideration 4 in the 
above list. Total time or the counting process formulation will typically be 
used if the risks evolve from a common origin and gap time used if the risks 
evolve sequentially. 
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j i t%j S%j 


A 1 2 1 

A 2 5 1 

A 3 14 0 

B 1 7 1 

B 2 12 1 

B 3 17 0 

C 1 14 1 


• Baseline hazard (within-unit stratification) 

— Common baseline hazard: the same baseline hazard is assumed 

for all recurrent events 

- Event-specific baseline hazard: a different baseline hazard h^Uj) is spec¬ 
ified for each event i 

The choice of baseline hazard is guided by consideration 1 in the list above. 
A common baseline hazard is typically specified if the events are of the same 
type whereas event-specific baseline hazards are specified if the events are 
of different types. 

• Risk set for the kth. event (between-unit stratification) 

- Restricted: contributions to the kth. risk-set are restricted to units having 
experienced k—1 events 

— Unrestricted: risk-sets include all units still at risk regardless of how 
many events they have previously experienced 
— Semi-restricted: contributions to the fcth risk-set are restricted to units 
having experienced fewer than k events 

The choice is guided by consideration 2 in the above list (whether a unit is 
at risk for the next even only if the previous event has occurred). 

Table 12.2 gives combinations of risk intervals, risk sets and baseline hazards 
and key references for continuous time proportional hazards modeling. 

12.3 Onset of smoking: Discrete time frailty models 

12.3.1 Introduction 

We will analyze data on the smoking behavior of school children (Lader 
and Matheson，1991) previously analyzed by Pickles et al. (2001) and Rabe- 
Hesketh et al. (200Id). Two cross-sectional surveys of school children aged 11 
to 15 years were carried out in 1990 and 1993, giving a time-sequential design 
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Figure 12.1 Illustration of risk intervals for total time, counting process and gap 
time (Adapted from Kelly and Lim, 2000) 


(e.g. Appelbaum and McCall, 1983). Both studies followed similar two-stage 
sampling designs with schools as primary sampling units. The 1990 sample 
includes 3124 pupils from 125 schools and the 1993 sample includes 3140 dif¬ 
ferent children from 110 schools. We assume that the sampling fraction for 
schools is sufficiently low so that the possibility of the same schools appear¬ 
ing in both samples can be ignored. The 5 years of classes sampled within 
each survey and the 3-year interval between surveys resulted in some age co¬ 
horts being sampled twice (e.g. 11-year-olds in 1990 are the same cohort as 
14-year-olds in 1993). 

The children were asked whether they had ever smoked a cigarette, and if 
so, how old they were the first time they smoked. In 2% of observations the 
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Table 12.2 Combinations of types of risk interval, risk set and baseline hazards with 
key references 




Risk set/ 
baseline hazard 


Risk interval 

Unrestricted/ 

common 

Semi-restricted / 
event-specific 

Restricted/ 

event-specific 

Total time 

Lee, Wei 
&; Amato (1992) 

Wei, Lin 

& Weissfeld (1989) 


Counting 

Andersen 


Prentice, Williams 

process 

& Gill (1982) 


k, Peterson (1981) 

Gap time 


Not possible 

Prentice, Williams 

Sz Peterson (1981) 


Source: Kelly and Lim (2000) 


child did not remember the age of onset (left censoring) and these observations 
were discarded. The children were surveyed after different periods of time since 
their last birthdays, giving those children interviewed immediately after their 
birthday little opportunity to have had a first cigarette at their current age. 
We will therefore analyze the data as if we had interviewed the children at 
the last birthday, treating the age of first experimentation as right-censored 
if it equals the current age. 

There are six possible discrete-time durations y = s corresponding to the 
possible age-ranges of onset, r s _i < T < r s , s = 1,..., 6: 

參 s = l: Before age 11, to = 0, ri = 11 

• s = 2: Age 11, ti = 11, 丁 2 = 12 

• 5 = 3: Age 12, T 2 = 12, 7*3 = 13 

• s = 4: Age 13, T 3 = 13, 7*4 = 14 

• s = 5: Age 14, 7*4 = 14, r 5 = 15 

• 5 = 6: Age 15 and above, % = 15, r6 = oo (always censored) 

Note that s = 6 is always censored because 15 was the oldest age and age of 
onset at the current age is treated as right-censored. The setup is presented 
in Table 12.3. 

The explanatory variables that we consider as possible influences on age of 
first experimentation with smoking are: 

• [Girl] a dummy for pupil a girl versus a boy 

• [Boy] a dummy for pupil a boy versus a girl 

• [Cohort2] a dummy for second versus first cohort 

• [Parsmoke] a dummy for presence of a smoking parent at home. 
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Table 12.3 Ages, possible times of onset and associated probabilities 


(a) Proportional Odds 

Age 

11 

12 

13 

14 

15 


T <11 

T <11 

T <11 

T <11 

T <11 


⑹ 

⑹ 

⑹ 

⑹ 

⑹ 

Possible 

11 <T 

11 <T< 12 

11 <T< 12 

11 < T < 12 

11 < T < 12 

of 

a - Pi) 

(p 2 -Pi) 

(p 2 - Pi) 

(P2-P1) 

(P2-P1) 

onset 

_ 

12 < T 

12 < T < 13 

12 < T < 13 

12 < T < 13 


- 

(u 2 ) 

(P 3 - P2) 

(p 3 - p 2 ) 

(p 3 - p 2 ) 


- 

- 

13 < T 

13 < T < 14 

13 < T < 14 


- 

- 

(1 - -Ps) 

(P 4 - Pa) 

(P4 - p 3 ) 


- 

- 

- 

14 < T 

14 < T < 15 


- 

- 

- 

(1-A) 

(P5 - P4 


- 

- 

一 

一 

15 < T 


- 

- 

- 

- 

(ID 

(b) Current Status 

Age 

11 

12 

13 

14 

15 


T <11 

T < 12 

T< 13 

T< 14 

T < 15 


⑹ 

(P2) 

(Ps) 

⑹ 

⑻ 


11 <T 

12 <T 

13 < T 

14 < T 

15 <T 


a - Pi) 

(1 - P2) 

(1 - Ps) 

(1 - 巧） 

(1 - 幵） 


Source: Rabe-Hesketh et al. (2001d) 


12.3.2 Random intercept models 

We consider two approaches to modeling discrete survival times, discrete time 
hazard models and the proportional odds model for ordinal data (with cen¬ 
soring). 

Discrete time hazards models can be estimated by expanding the data as 
outlined in Display 2.2 on page 46. In the expanded dataset, a child with age 
of first experimentation in the 5th interval has responses yi r , r = 1,..., 5 ； a 
child who was interviewed during the 5th interval but whose age of onset was 
not before his last birthday has responses yi r , r = 1,..., 5 — 1. The responses 
are equal to 1 if experimentation occurred in the corresponding interval and 
zero otherwise. 

Logistic regression for the expanded data then corresponds to a continu- 
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ation ratio model whereas complementary log-log regression corresponds to 
a proportional hazards model (see Section 2.5.2). In addition to the usual 
covariates, the models include a constant s = 1,..., 5 for each possible 
interval of first experimentation. 

To model unobserved heterogeneity between schools, we can include a ran¬ 
dom intercept in the linear predictor for either of these models. The linear 
predictor for the rth response for the ith pupil in the jth. school is therefore 


^ijr 


: X^/3 -\- K r -\- Q, 


( 12 . 1 ) 


where Cj 〜 N(0,^). 

We also consider a censored proportional odds model with a random inter¬ 
cept. Here the response probabilities are as shown in Table 12.3 with cumu¬ 
lative probabilities 

P s = Pr(^<r s ) = i:K 二 d) = p- 1 (x^ + 0 + « s ), 


where g is the logit link. Note that the sign of the regression coefficients (3 is 
reversed compared with the usual proportional odds model so that positive 
coefficients imply an increased odds of low responses versus high responses. 
For noncensored durations, the response probabilities have the form 


Pr(r s _i <T<t s ) = + 0 + « s ) - + Q + « s -i). 

The model can therefore be estimated using the composite link function dis¬ 
cussed in Section 2.3.5. Alternatively, the model can be estimated by treating 
the responses of children of different current ages as distinct ordinal responses 
with different numbers of categories (see Table 12.3(a)). For example, those 
who were aged 11 when surveyed have an ordinal response with two possible 
categories and those who were aged 12 have a response with 3 categories, etc. 
The thresholds k s are then constrained equal across responses. 

Parameter estimates are given in the first three columns of Table 12.4. 
All three models lead to essentially the same conclusions. Females are less 
at risk than males if no parent is smoking at home. A parent smoking at 
home increases the risk of smoking for boys and this effect is even greater 
for girls. There is an effect of cohort for girls with the risk of having a first 
cigarette earlier increasing over time. There does not appear to be a cohort 
effect for boys. Although the estimated ^ are quite small, there is significant 
heterogeneity between schools in the ages of onset. (For both continuation 
ratio and proportional hazards models the likelihood ratio tests give p < 0.001 
using half the p-value based on the x 2 (l) distribution; see Section 8.3.4.) 
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Fixed Effects: 

[Girl] -0.34 

[Parsmoke] 0.31 

[Girl] x [Parsmoke] 0.29 

[Boy] x [Cohort2] 0.02 

[Girl] x [Cohort2] 0.08 


Telescoping: 

[Boy] 

[Girl] 

Thresholds or constants: 
ki -2.17 

-2.41 
— 1.91 

K4 -1.38 

K5 — 1.23 


Random Effect: 

^ 0.07 (0.02) 

Log-likelihood -6225.5 

Source: Rabe-Hesketh et al. (2001d) 


Table 12.4 Estimates and standard errors for random intercept discrete time duration models 

Cont. ratio Prop, hazards Prop, odds Curr. status Telescoping 

Parameters Est (SE) Est (SE) Est (SE) Est (SE) Est (SE) 


幻乃的幻幻 ⑺ S ' 幻 
loooo 11111 
o.0.0.0.0.o.0.0.0.0. 


15081103030404181614121203 

(o.(o.(o.(o.(o.(o.(o.(o.(o.(o.(o.(o.(o. 


W.36.03.08 




08)08)12)03)03) 

(o.(o.(o.(o.(o. 

2 5 8 4 9 

3 3 3 0 0 

o.0.0.0.0. 


⑺ 0 幻幻 
2 1111 
(o.(o.(o.(o.(o. 

3 2 10 9 
.4.7.9.3.1 

2.1 .o.o.o. 

---- 


4)8)^r 幻幻 

1 o 1 o o 

(o.(o.(o.(o.(o. 


111111101103 

(o.(o.(o.(o.(o.(o. 

8 2 9 2 2 9 
.!.5.8.2.3.o 

2.1 .o.o.o.o. 

---- 


11)06)09)02)02) 

(o.(o.(o.(o.(o. 




o o o o 1 o 

(o.(o.(o.(o.(o.(o. 

2 5 8 18 6 
2 4 9 5 3 o 

2.2.1.1.1.a 

----- 
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12.3.3 Modeling ‘telescoping’ effects 

One problem with the three models considered above is that they assume 
that the recalled ages of onset are reliable. Accounting for measurement error 
in age-of-onset data has received rather little attention in the literature. We 
consider two alternative approaches. 

The first approach is to discard the timing element of the children’s re¬ 
sponses and simply model their current status (ever experimented) as a func¬ 
tion of their current age, using a simple logistic regression model with the 
current smoking status indicator as the response variable, as indicated in Ta¬ 
ble 12.3(b). This gives the results in column 4 of Table 12.4 which are not 
very different from those for the proportional odds model. 

Another approach is to model recall bias directly. It has been suggested that 
recall errors are characterized by an apparent shifting of events from the more 
distant past towards the time at which data collection is made (Sudman and 
Bradburn, 1973; Hobcraft and Murphy, 1987). This ‘telescoping’ could arise 
from an internal compression of the time scale so that an event that occurred 
a time t ago is reported as occurring a time 7 t ago with 0 < 7 < 1 . 

Telescoping could also result from heteroscedastic measurement error where 
the error variance increases with the lag between the event and the time 
of recollection even when the errors are symmetrically distributed. This is 
because more events from the distant past, that are typically recalled with 
larger errors, are shifted into the recent past than events in the recent past, 
that are typically recalled with smaller error, are shifted back into the distant 
past (Pickles et al. ， 1996). While Pickles et al. develop models to distinguish 
between these processes, here we will only consider systematic telescoping. 

In the proportional-odds model, we assume that the log odds that the re¬ 
called age of onset is before a given age r s decreases linearly with the time 
that has passed since that age, a^* — r s , where a^- is the child’s current age: 

ln 1 - r ffr^<r s ) = + K% 广 丁» + k s ‘ 

Here, a is a vector of coefficients and w^- is a vector of explanatory variables 
that may predict the degree of telescoping (positive coefficients imply a com¬ 
pression of the time scale). If the proportional odds model is interpreted as a 
latent response model, telescoping corresponds to allowing the thresholds to 
depend on the time-lag, i.e., the thresholds are 

_(<%+ -r^w^a + Ks. 

We assume that the degree of telescoping depends on sex only, giving the 
parameter estimates in the last column of Table 12.4. While there is little 
evidence of telescoping for girls, the boys tend to stretch the time scale (rather 
than compress it), perhaps Showing off’ with having experimented earlier than 
they actually did. 

Note that separate identification of telescoping and cohort effects is possible 
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here because some of the cohorts of children are represented in both surveys 
at different ‘current，ages and therefore different time lags. 


12.4 Exercise and angina: Proportional hazards random effects 
and factor models 

12.4-1 Introduction 

We analyze the dataset 1 published in Danahy et al. (1976) and previously 
considered by Pickles and Crouchley (1994 ， 1995). Subjects with coronary 
heart disease participated in a randomized crossover trial comparing the effect 
of the drug Isorbide dinitrate (ISDN) with placebo. 

Before receiving the drug (or placebo), subjects were asked to exercise on 
exercise bikes to the onset of angina pectoris or, if angina did not occur, 
to exhaustion. The exercise time and outcome (angina or exhaustion) were 
recorded. The drug (or placebo) was then taken orally and the exercise test 
repeated one hour, three and five hours after drug (or placebo) administra¬ 
tion. We therefore have repeated 4 survival’ times per subject pre and post 
administration of both an active drug and a placebo. 

Each subject repeated the exercise test 4 times in the placebo condition and 
4 times in the drug condition. The responses of interest are the durations to 
angina or exhaustion in the drug condition yij for exercise test % and subject 
j. dij is a censoring indicator taking the value 1 if angina occurred and 0 
otherwise. The duration to angina Uj in the corresponding placebo condition 
will be treated as a time-varying covariate (there were no censorings under 
the placebo). Unfortunately, the order in which placebo and drug were given 
was not reported in Danahy et al. (1976). 

Since the subjects started each of the eight exercise tests at rest, so that 
the same processes leading to angina or exhaustion can be assumed to begin 
at the start of each exercise test, we will assume that the hazard functions for 
the tests are proportional. According to the classification of aspects involved 
in constructing risk sets on page 374 we specify: 

Risk interval formulation: gap time with the time scale as starting at 0 
at the beginning of each exercise test 

Baseline hazard: common baseline hazard for each exercise test 

Risk set for kth event: unrestricted because each event occurs in a sepa¬ 
rate exercise test. 

This specification corresponds to assuming proportional hazards between tests 
and hence between pre and post drug administration. This is required to 
specify the treatment effect as a hazard ratio. 

The following covariates were considered: 


The data can be downloaded from gllanmi.org/books 
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• [Bypass] a dummy variable for previous coronary artery bypass graft surgery 


f 1 if previous bypass surgery 
I 0 otherwise 


• [TimeP] standardized time to angina in the placebo condition; tij 

• [After] a dummy variable for drug administered, i.e. 

f Oif i = l 

XTij = \ 1 if i = 2,3,4 

• [Lin] linear trend of drug effect 

_ / 0 if i=l 

XDij = X i-3 i£i = 2,3,4 


The definitions of the covariates that are varying over exercise tests [TimeP], 
[After], and [Lin] are summarized in Table 12.5. 


Table 12.5 Definitions of covariates varying over exercise tests % 


Test i 

[TimeP] 
Placebo Uj 

[After] 

^Tij 

[Lin] 

^Dij 

1 

tij 

0 

0 

2 

亡 2j 

1 

-1 

3 


1 

0 

4 


1 

1 


12.4-2 Cox regression 

The Cox proportional hazards model can be written as (see also equation (2.22)) 
hij(t) = /i°(^)exp(z/^), 

where h°(t) is the baseline hazard, the hazard when Uij = 0. The linear pre¬ 
dictor is 

^ij = Pl^Bj H - P2^ij Ps^Tij H~ P^Dij 5 ( 12 . 2 ) 

where "1 represents the effect of [Bypass], /?2 the effect of time to angina under 
placebo [TimeP], Ps the difference in treatment effect between the mid post¬ 
treatment test i = 3 and the baseline i = l [After] and /?4 the linear change in 
treatment effect over the three post-treatment tests [Lin]. 

It was pointed out in Section 2.5.1 that including a dummy variable for each 
risk set in Poisson regression and using the log interval length between failure 
times as an offset leads to a likelihood proportional to the partial likelihood. A 
problem with this approach is that there will often be a very large number of 
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(3i [Bypass] 0.88 (0.29) 2.42 

[TimeP] -1.17 (0.20) 0.31 

Ps [After] -1.02 (0.29) 0.36 

Pa [Lin] 0.62 (0.19) 1.85 

Log-likelihood —224.76 


the estimates and standard errors are identical under partial likelihood and 
maximum likelihood Poisson regression with dummies as prescribed by the¬ 
ory. Also note that the estimates from Cox regression are well recovered by 
modeling the log baseline hazard as a 6th order orthogonal polynomial of 
time. This approach is much more parsimonious than Poisson regression with 
a dummy variable for each risk set, which requires 57 additional parameters 
in the present example. For larger datasets, the number of additional param¬ 
eters will simply be unmanageable. We will therefore rely on the orthogonal 
polynomial approximation to the log baseline hazard In [ft 0 ⑺] for the duration 
models with latent variables considered subsequently. 

Figure 12.2 shows the estimated constants and their 95% confidence inter¬ 
vals for the Poisson model with dummies for the risk sets. The smooth curve 
is the estimated log baseline hazard for the Poisson model with a sixth or¬ 
der polynomial of time. The polynomial does not seem to oversmooth the log 


risk sets and therefore an excessive number of parameters to be estimated. For 
instance, as many as 64 dummies are required for the present small dataset. 
This suggests approximating the baseline hazard with a parsimonious smooth 
function of time. Specifically, we consider an orthogonal polynomial of degree 
6 , 

ln/i't) = ao+ai/i ⑷ + 0 : 2 / 2 ⑷ +a 3 / 3 ⑷ +a 4 / 4 (t)+a 5 / 5 ⑷ +a 6 / 6 ⑷， (12.3) 

where fk(t) is the kth. order term. Note that although the baseline hazard is 
modeled as a smooth function of time, the piecewise exponential formulation 
implies that the hazard is constant between successive events giving steps 
whose heights are determined by the smooth function. 

Estimates and standard errors for the effects of [Bypass], [TimeP], [After] 
and [Lin] are reported in Table 12.6 for partial likelihood (Cox regression), 
maximum likelihood Poisson regression with dummies for the risk-sets and 
maximum likelihood Poisson regression with a 6th order orthogonal polyno¬ 
mial of time (orthpol-6). For each implementation, we report the estimated 
/3’s，standard errors (SE) and hazard ratios exp(/3) (denoted HR). Note that 

Table 12.6 Estimates for different implementations of Cox regression 

Partial ML Poisson ML Poisson 

likelihood dummies orthpol-6 

Parameter Est (SE) HR Est (SE) HR Est (SE) HR 


.45.33.3.81 

0 0 9 9 L 
.3.2.2.120 
⑴⑴⑴⑴ 3 

.90.12.00.59 

o-l-lo 

2 16 5 
.4.3.3.8 

29202919B4. 

(o.(o.(o.(o .-28 

.88.17.02.62 
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baseline hazard since the curve falls well within the confidence intervals for 
the constants. 





°P 


100 200 300 400 500 600 



Time in seconds 


Figure 12.2 Log baseline hazard for angina data. Points represent constants esti¬ 
mated for each risk set with 95% confidence intervals shown as dotted lines. Curve 
represents the corresponding sixth order polynomial 


12.4-S Proportional hazards regression with multinormal latent variables 

We first consider the conventional frailty model, a simple random intercept 
model with linear predictor 

= PlXBj + Ihtij Ps^Tij + p4XDij + Clj, 
where CijN(0, ^n)- The exponential of the random intercept, exp(Cij )， 
is often called the frailty, and we are hence specifying a log-normal frailty 
(e.g. McGilchrist and Aisbett, 1991). Alternative distributions for the frailty 
include the one-parameter gamma (e.g. Clayton, 1978), the positive stable 
(e.g. Hougaard, 1986b) and the inverse Gaussian (e.g. Hougaard, 1986a). 

Estimates and standard errors for the random intercept model are given in 
the second column of Table 12.7. The estimates for the fixed part are also 
given in exponentiated form, i.e. as hazard ratios HR. 

To allow for a less restrictive dependence structure, we depart from the 
conventional frailty model and consider a factor model 

^ij = PlXBj + p2tij + PsXTij + P4：XDij + 入 iOj • (12.4) 
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Fixed part: 

0i [Bypass] 

02 [TimeP] 

0s [After] 

04 [Lin] 
Random part: 
中 li 
Ai 

A 3 

A 4 

Log-likelihood 


Here, Cij • 〜 N(0 ， ^ii) and we have identified the model by restricting one 
of the factor loadings, A 2 = 1. Note that the frailty model is obtained if the 
restrictions Ai = A 2 = A 3 = A 4 = 1 are imposed in the factor model. Estimates 
and standard errors for the factor model are given in the third column of 
Table 12.7. We prefer this model to the random intercept model since the 
log-likelihood increases quite considerably. 

Another candidate is a random treatment model where [After] has an asso¬ 
ciated random coefficient (2j but there is no random intercept, 

= PlXBj + p2tij + (^3 + C2j)x T ij + PsXDij, (12.5) 

where ( 2 】 • 〜 N(O, , 022 ) - Estimates and confidence intervals for the random 
treatment model are given in the second column of Table 12.8. Note that this 
model is a special case of the factor model with Ai = 0 and 入 2 = 入 3 = 入 4 = 1. The 
decrease in the log-likelihood from imposing these restrictions is moderate, 
which could have been surmised from inspecting the estimated factor loadings. 
Figure 12.3 shows the log-hazards for the third exercise test for hypothetical 
individuals with [TimeP] equal to the mean (240.5) and log frailties of —2, 
— 1 , 0 , 1 and 2 standard deviations. 

Finally, we consider a random intercept and treatment model. This model 
has two random effects, a random intercept (jj correlated with a random 
treatment effect (2j, 

^ij = Clj Pl^Bj H - P2^ij {Ps H~ C2i)^Tij H~ P^Dij 5 
where (Clj, C 21 ) are assumed to have a bivariate normal distribution, with vari- 


Prailty Factor 

Parameter Est (SE) HR Est (SE) HR 


Table 12.7 Estimates for random intercept (frailty) and factor structured propor¬ 
tional hazards models 




o ) 旬 u "ST 乃 ⑺ 4 4 4 
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Table 12.8 Estimates for random coefficient proportional hazards models 


Fixed part: 

/3i [Bypass] 

/?2 [TimeP] 

/3s [After] 

Pa [Lin] 
Random part: 
岭 li 
岭 22 
岭 21 

Log-likelihood 


卞 boundary solution 


ances -0H and 矽 22 and covariance ^21 • The multivariate normal distribution 
is a convenient choice for models with several latent variables since it is not 
obvious how to generalize the other (univariate) distributions often used for 
frailty models. Estimates and standard errors for this model are given in the 
second column of Table 12.8. The estimated correlation between the random 
effects is virtually 1 , and this boundary solution might indicate that one ran¬ 
dom effect would suffice for this application. This is also reflected in the very 
small increase in log-likelihood compared to the random treatment model. We 
therefore abandon the random intercept and treatment model. 

The estimated hazard ratios HR, reported in the fixed part in Tables 12.7 
and 12.8, are conditional on the latent variables. We observe that the estimates 
of the hazard ratio for [Bypass] are larger than 1 for all models apart from the 
factor model. However, the estimates are quite imprecise and vary considerably 
across models, from 2.96 to 0.96. As would be expected, longer durations under 
placebo seem to reduce the hazard under the drug condition. The estimates 
of the hazard ratio for [After] suggest that the drug reduces the hazard at the 
mid post-treatment exercise test compared with baseline. We also note that 
the log hazard increases over the post-treatment tests, since ^4 is estimated 
as positive. 

Regarding the random part, we note that a likelihood ratio test for the 
random intercept variance in the random intercept model strongly indicates 
that it is needed, which contradicts the corresponding Wald test. We recom¬ 
mend using the likelihood-ratio test with halved p-value for testing the null 
hypothesis that the variance of a latent variable is zero; see Section 8.3.4. 


Random Random 

treatment intercept & treatm. 

Parameter Est (SE) HR Est (SE) HR 


6113230 
1 . 0 . 0 . 2 . 
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100 200 300 400 500 600 

Time in seconds 

Figure 12.3 Log hazards for exercise test 3 for hypothetical individuals with log frail¬ 
ties Q equal to 一 2 ，一 1 ， 0， 1 and 2 estimated standard deviations p/ 2 


12.4-4 Latent class proportional hazards models 

Sometimes random effects are assumed to be truly discrete leading to latent 
class models. Based on the results reported for continuous random effects，we 
will concentrate on the random treatment and factor models in this section. 
The models take the same form as in the continuous case, with the vital 
difference that the random effects Q are now discrete with locations 

Cj = e c ， c= 1,..., (7, 

and probabilities 7r c instead of normally distributed. We also confine the dis¬ 
cussion to two-class models (7=2. 

Estimates for the two-class random treatment model in (12.5) are reported 
in the second column of Table 12.9. The location —2.59 of the first class for 
the random treatment model suggests that there is a class of patients who 
responds remarkably well to the drug. Moreover, this class comprises 21% of 
the population. 

Regarding the fixed part estimates, the results for both latent class models 
are quite similar to the estimates for the normal random treatment and factor 
models reported in Table 12.7. The loadings in the factor model have the same 
pattern as for the model assuming normality. 



o?+ (e^nl—lPJ ceZBq Soq 
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Fixed part: 
fh [Bypass] 

P2 [TimeP] 

/?3 [After] 

/?4 [Lin] 

Random part: 
Var(%) 

Ai 
入 2 
入 3 

a 4 

Locations: 

ei 

e2 

Class prob.: 
Pr(ex) 

Pr(e 2 ) 

Log-likelihood 


12.4-5 Nonparametric maximum likelihood estimation 

Assuming parametric continuous distributions for the random effects, such as 
the multivariate normal used above, has caused consternation among some 
researchers. Admittedly, it is problematic if results are highly sensitive to the 
choice of parametric distributions whose main motivation is convenience. An 
alternative is to use models with latent classes as described above, but it often 
seems unreasonable to assume that the population consists of truly discrete 
classes. This has led to the development of duration models where the latent 
variable distribution may be continuous and/or discrete, not requiring any 
prior specification. Both locations and mass-points for the latent variables are 
then estimated, and nonparametric maximum likelihood estimates (NPMLE) 
are obtained when no further improvement in the likelihood can be achieved 
by introducing additional mass-points (e.g. Heckman and Singer, 1982, 1984); 
see Section 6.5. 

NPML estimates for the random treatment model are reported in the sec¬ 
ond column of Table 12.10. The location and mass estimates of the obtained 


Random treatm. Factor 

Parameter Est (SE) HR Est (SE) HR 


Table 12.9 Estimates for 2-class random treatment and factor models 
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Fixed part: 

Pi [Bypass] 

P2 [TimeP] 

/3s [After] 

[Lin] 

Random part: 
Var(O) 

Ai 

入2 

入3 

入4 

Log-likelihood 


3 mass solution are (—2.91, 0.32, 2.84) and (0.21, 0.66, 0.14), respectively. 
NPML estimates for the factor model in (12.4) are given in the third column 
of Table 12.10. A five mass solution was obtained with locations estimated as 
(—18.10, —5.51, 0.74, 3.88, 6.91) and corresponding masses as (0.10, 0.10, 0.28, 
0.44, 0.08). Note that the first class has a frailty approaching zero, indicat¬ 
ing a boundary solution. This is reflected in the large estimated Var(Cj). The 
estimates are close to those assuming a normally distributed factor, notwith¬ 
standing that the nonparametric distribution of the factor appears to be quite 
skewed. Thus, the factor model appears to be robust against misspecification 
of the factor distribution in this application. 

12.5 Summary and further reading 

We initially discussed special issues arising in constructing risk sets for multi¬ 
ple events clustered duration data, pointing out that appropriate modeling of 
this case is more complex than modeling of single events clustered duration 
data. 

The first application used discrete time frailty models for the onset of smok¬ 
ing among adolescents. Approaches to handling recall bias for age of onset data 
were discussed and a model with 4 systematic telescoping’ was estimated. 

We then considered proportional hazards models for onset of angina pec¬ 
toris in a randomized clinical trial. Different kinds of latent variables were 
introduced to model the dependence among repeated durations within each 


Random treat. Factor 

3 masses 5 masses 

Parameter Est (SE) HR Est (SE) HR 


Table 12.10 NPMLE for random treatment and factor models 
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patient. Models with continuous latent variables such as factor models and 
various random effects models as well as discrete latent variables were es¬ 
timated. We also left the latent variable distribution unspecified by using 
nonparametric maximum likelihood. 

Hougaard (2000) provides an extensive treatment of multivariate frailty 
models, concentrating on models with positive stable frailty distributions. 
Vermunt (1997) gives an easily accessible discussion of survival models with 
discrete latent variables. Useful papers on survival or duration modeling with 
frailty and latent variables include Clayton (1978), Vaupel et al. (1979), Clay¬ 
ton and Cuzick (1985), Aalen (1988)，Pickles and Crouchley (1994, 1995) and 
Vaida and Xu (2000). 

There has recently been considerable interest in joint modeling of dura¬ 
tion and other responses, see Hogan and Laird (1997ab) for an overview. We 
consider a joint model for survival of liver cirrhosis and a marker process in 
Section 14.6, estimating both direct and indirect effects of treatment. 
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CHAPTER 13 


Comparative responses: Polytomous 
responses and rankings 


13.1 Introduction 

Polytomous responses (also known as 4 first choices’ or 4 discrete choices’) ， pair¬ 
wise comparisons and rankings were introduced in Section 2.4.3. A distin¬ 
guishing feature of the latent response formulation for these processes is that 
the response is a vector of utilities. 

It is unrealistic to assume that the utilities for different alternatives are un¬ 
correlated. For the multinomial logit model this would imply the unrealistic 
property known as ‘independence from irrelevant alternatives’ to be discussed 
in Section 13.2. To model the dependence, it is tempting to simply intro¬ 
duce factors and/or random coefficients in the same way as in multivariate 
regression models. However, in contrast to the case of continuous responses, 
we do not observe the multivariate vector of utilities directly because the 
response processes are comparative. Identification thus becomes more diffi¬ 
cult as discussed in Bunch (1991)，Keane (1992) and Skrondal and Rabe- 
Hesketh (2003b). In Section 13.3 we describe how the fixed and random parts 
of the model are usually structured. 

We consider multinomial logit instead of probit models with latent variables 
in this chapter. The reason is pragmatic; the models are very similar but the 
logit versions are more convenient computationally. Specifically, the condi¬ 
tional probability of a response given the latent variables can be expressed 
in closed form for multinomial logit models in contrast to probit models. As 
was pointed out in Section 2.4.3, obtaining response probabilities in multino¬ 
mial probit models requires integration even when the model is void of latent 
variables. 

The applications considered in this chapter are from political science and 
marketing. We discuss models for rankings as well as discrete choices, both 
with continuous and discrete latent variables. 

13.2 Heterogeneity and ^Independence from Irrelevant 
Alternatives 9 

It follows from the logistic discrete choice model (2.18) presented on page 37 
that the odds for two alternatives a and b for unit i are 

= ex P « _z ^)， C 13 * 1 ) 
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which only depend on the linear predictors for the two alternatives. Hence, the 
odds do not depend on which other alternatives there are in the alternative 
set. In the ranking case it furthermore follows that the odds do not depend 
on which other alternatives have already been chosen, the number of alter¬ 
natives already chosen or the order in which those alternatives were chosen. 
Luce (1959) denoted this property ‘Independence from Irrelevant Alternatives’ 
(IIA). 

The problem associated with the IIA property can be illustrated by adapt¬ 
ing the red bus - blue bus example of McFadden (1973). Let there be three 
political parties, Labour I, Labour II and Conservatives. The first two parties 
are indistinguishable and have the same linear predictor z/] ab whereas the Con¬ 
servative party has linear predictor z/f on . The probability of voting for either 
Labour I or Labour II is 

2exp(^ ab ) 

2 exp(i/] ab ) + exp« on ). 

Considering the scenario that the two Labour parties merge to form a single 
Labour party, the probability of voting Labour decreases to 



exp ( 十 b ) + exp(i/9on) 


which is clearly counterintuitive. The model is thus often deemed unduly re¬ 
strictive (e.g. Takane，1987 and the references therein). 

The consequences of IIA are most pronounced for an indifferent voter with 
z/] ab — i/9 on = 0. The merger would in this case reduce the probability of vot¬ 
ing Labour from 0.67 to 0.50, consistent with an equiprobable choice among 
initially three and then two available parties. However, in reality almost all 
voters will have a party preference and there will be heterogeneity in party 
preference among voters. This heterogeneity, observed and unobserved, en¬ 
sures that the model does not imply a substantial reduction in the share of 
the Labour vote due to the merger. 

To illustrate the effect of observed heterogeneity, consider a fixed effect of 
gender giving ^] ab — u^ on = —1.2 for men and — u^ on = 2.8 for women. For 
a population with 50% men, the merger reduces the marginal probability for 
Labour from 0.67 to 0.59 and, marginally to the observed covariates, IIA is 
therefore violated. In practice, observed covariates cannot explain all variabil¬ 
ity in individual party preferences. The remaining unobserved heterogeneity 
can be modeled by including a shared random effect for the two Labour par¬ 
ties in the linear predictor. For example, if z/] ab — iyf on = —0.8 + Q for men 
and ^] ab — z/^ on = 3.2 + G for women, where ^ is normally distributed random 
intercept with variance 16, the marginal (with respect to probabilities for 
Labour become 0.67 before the merger and 0.63 after. Increasing the magni¬ 
tude of the fixed effects or the random effects variances decreases the difference 
even further. 

A consequence of failing to include random effects would be that the un- 
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realistic IIA property still holds conditional on the observed covariates, e.g. 
among men and among women if gender is the only observed covariate. Even 
with voter-specific random effects IIA still holds for a given voter. This could 
be relaxed by including voting-occasion-specific random effects if the resulting 
model is identified. 


13.3 Model structure 

We give a brief description of the model structure suggested by Skrondal and 
Rabe-Hesketh (2003b) and refer to this work for details. For simplicity, we 
consider two-level models with units i (level 1) nested in clusters j (level 2). 

It is useful to formulate multilevel models for nominal responses in terms 
of latent utilities u^j with 

u ij = v ij e ij • 

Recall from Section 2.4.3 that multinomial logit models result if the ef) are 
specified as independently Gumbel distributed. The linear predictor v\- rep¬ 
resents the mean utility for category or alternative s and contains a fixed part 
fij and random parts and at levels 1 and 2, respectively, 

<• = f^ + S^ + St^ + etj. 

For polytomous responses, it is assumed that the alternative with the greatest 
utility is chosen and for rankings the alternatives are ranked according to the 
ordering of the utilities. 

Let us now describe the structure of the fixed and random parts of the 
model. 


13.3.1 Fixed part 

The fixed part of the model is structured as 

f!j = m s + x| ； bH-x^.g s , (13.2) 

where m s is a constant, xf) is a vector which varies over alternatives and 
may also vary over units and/or clusters, whereas the vector x^- varies over 
units and/or clusters but not alternatives. The corresponding fixed coefficient 
vectors are b and g s , respectively. Note that the effect b is assumed to be the 
same on all utilities, so that this part of the model simply represents a linear 
relationship between the utilities and alternative (and possibly unit and/or 
cluster)-specific covariates. For instance, in the election example considered 
in Section 13.4, we will assume a linear relationship between a measure of 
distance on the left-right political dimension between the 5th party and the 
ith. voter and the voter’s utilities for the party. For some alternative and unit 
and/or cluster-specific variables the effect may differ between alternatives. 
Such effects can be accommodated by including interactions between these 
variables and dummy variables for the alternatives in xf^-. 

Discrete choice models only including fixed effects have been considered by 
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Train (1986) and ranking versions by Chapman and Staelin (1982) and Allison 
and Christakis (1994). The conditional logit model, standard in econometrics, 
arises as the special case where x;g s and m s are omitted in (13.2). The model 
is used for discrete choices (e.g. McFadden ， 1973)，rankings (e.g. Hausman 
and Ruud, 1987) and paired comparisons (e.g. Bradley and Terry, 1952). The 
polytomous logistic regression models a standard model for discrete choices in 
for instance biostatistics (e.g. Hosmer and Lemeshow, 2000)，results as the 
special case where xfjb is omitted in (13.2). 


13.3.2 Level-1 random part 

The component 增 1 )， inducing dependence between alternatives within units, 
is structured as 

= 心” +入种必， (13-3) 

where 松 ） are random coefficients allowing the effects of alternative-specific 
covariates zg 1 ) to vary between units i and are factors representing un¬ 
observed variables having effects 入 ^⑴ on ufj. Alternatively, 入心 ） can be in¬ 
terpreted as unobserved attributes of alternative s and as random effects 
on the utilities. In the election example, letting z s ^ = xg 1 ) represent the 
distance on the left-right political dimension, the random slope allows the 
effect of political distance to vary between voters. 

A model of the above type was discussed by McFadden and Train (2000) 
for the case of a multinomial logit. A special case including only random coef¬ 
ficients was considered by Hausman and Wise (1978). Models only including 
common factors were suggested for paired comparisons by Bloxom (1972) and 
Arbuckle and Nugent (1973) and for rankings by Brady (1989)，Bockenholt 
(1993) and Chan and Bentler (1998). It should be noted that unit-level fac¬ 
tor models are not identified for discrete choices unless unit and alternative- 
specific covariates are included in the fixed part (e.g. Keane, 1992). 


13.3.3 Level-2 random part 

The component 增 2 )， inducing dependence among utilities within clusters, is 
structured as 

= 4 2) 'if + zg)'7f) + (13.4) 

where and are vectors of random coefficients with corresponding 
variable vectors zg 2 ) and zg). g 2 ) allows the effects of alternative-specific 
covariates to vary between clusters whereas 7】( 2 ) represents the variability 
in the effect of unit-specific covariates between clusters. Note that there 
is in this case a random coefficient for each alternative. In the context of 
the election example Q could represent heterogeneity in the distance effect 
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between constituencies. 7 】( 2 ) are random intercepts when zg ) 二 1 and could 
more generally represent the random coefficients (varying over constituencies) 
of a covariate that varies within constituencies (e.g. age), are factors 
representing unobserved variables at the cluster level having effects A s ( 2 ) on 
uly The random terms at levels 1 and 2 are analogous except that there is 
no term corresponding to at level 1, since identification would then be 
fragile. 

The special case including only random coefficients for unit-specific covari¬ 
ates has been considered for discrete choices by Hedeker (2003) and Daniels 
and Gatsonis (1997) and for paired-comparisons by Bockenholt (2001a). Rev- 
elt and Train (1998) specified the special case where there are only random 
coefficients for alternative-specific or alternative and unit-specific covariates. 
Random coefficients for alternative-specific covariates were used in a conjoint 
choice experiment by Haaijer et al. (1998). The special case including only 
common factors was considered by Elrod and Keane (1995). McFadden and 
Train (2000), among others, specify 4 mixed logit models，containing both ran¬ 
dom coefficient and factor structures. Bockenholt (2001b) used this model for 
rankings, also including discrete random intercepts 7 ;( 2 ) = e】 c ，c = 1, … ,C. 

Latent class models, containing only a discrete random intercept, have been 
considered for discrete choice data by Kamakura et al. (1996), for rankings 
by Croon (1989) and for paired comparisons by Dillon et al. (1993), among 
others. We will give examples of latent class models for rankings and discrete 
choices in Sections 13.5 and 13.6, respectively. 

Allenby and Lenk (1994) and Skrondal and Rabe-Hesketh (2003b) are the 
only contributions we are aware of modeling dependence at both unit and 
cluster levels although the former do not include explicit terms representing 
unit-level heterogeneity. 

13.4 British general elections: Multilevel models for discrete 
choice and rankings 

Skrondal and Rabe-Hesketh (2003ab) modeled data from the 1987-1992 panel 
of the British Election Study (Heath et al, 1991, 1993, 1994). 1 1608 respon¬ 
dents participated in the panel. Voting occasions with missing covariates and 
where the voters did not vote for candidates from the major parties were ex¬ 
cluded. The resulting data comprised 2458 voting occasions, 1344 voters and 
249 constituencies. 

The alternatives Conservative, Labour, and Liberal (Alliance) are some¬ 
times labeled as con, lab, lib, corresponding to 5 = 1,2,3. The voters were 
not explicitly asked to rank order the alternatives, but the first or discrete 
choice clearly corresponds to rank 1. The voters also rated the parties on a 
five point scale from ‘strongly against’ to ‘strongly in favour’. We used these 

1 The data were made available by the UK Data Archive and the subset analyzed here can 

be downloaded from gllamm.org/books, courtesy of Anthony Heath. 
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ratings to assign ranks to the remaining alternatives, ordering the parties in 
terms of their rating. Tied second and third choices were observed for 394 
voting occasions yielding top-rankings (first choices only). 

The covariates considered are 

• [LRdist] the distance between a voter’s position on the left-right political 
dimension and the mean position of the party voted for. The mean posi¬ 
tions of the parties over voters were used to avoid rationalization problems 
(e.g. Brody and Page, 1972). The placements were constructed from four 
scales where respondents located themselves and each of the parties on 
a 11 point scale anchored by two contrasting statements (priority should 
be unemployment versus inflation, increase government services versus cut 
taxation, nationalization versus privatization, more effort to redistribute 
wealth versus less effort). 

• [1987] a dummy variable for the 1987 national elections 

• [1992] a dummy variable for the 1992 national elections 

• [Male] a dummy for the voter being male 

• [Age] age of the voter in 10 year units 

• [Manual] a dummy for father of voter a manual worker 

• [Inflation] rating of perceived inflation since the last election on a five point 
scale 

The data have a hierarchical structure with elections % (level 1) nested 
within voters j (level 2) and voters nested within constituencies (level 3). 
The variable [LRdist] is alternative and election-specific and will be denoted 
xl- k . The variables [1987], [1992] and [Inflation] are election-specific, denoted 
Xijk, whereas [Male], [Age] and [Manual] are voter-specific, denoted x#. 

The fixed part of the model includes all covariates and is of the same form 
as (13.2) with a constant effect b for [LRdist] and party-specific effects g s for 
all other covariates. The constant m s is not needed since the coefficients of 
[1987] and [1992] represent election-specific constants. 

We first estimated the conventional logistic model A^O without latent vari¬ 
ables for both discrete choices and rankings and the estimates are reported 
in Table 13.1 We see that the estimated effects of the election and/or voter- 
specific covariates are in accordance with previous research on British elec¬ 
tions. Being male and older increases the probability of voting Conservative, 
whereas a perceived high inflation since the last election harms the incumbent 
party (the Conservatives). The impact of social class is indicated by the higher 
probability of voting Labour among voters with a father who is/was a manual 
worker. Regarding our election, voter and alternative-speci^c covariate [RLd- 
ist], the estimate also makes sense: the larger the political distance between 
voter and party, the less likely it is that the voter will vote for the party. 

We consider three types of models for the random part: 

(a) a random coefficient model for political distance [LRdist], inducing de¬ 
pendence and allowing the effect of x^ k to vary over elections: 
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Table 13.1 Estimates for the conventional logistic model MO 


Ranking Dicrete Choice 



Lab vs. Con 
Est (SE) 

Lib vs. Con 

Est (SE) 

Lab vs. Con 
Est (SE) 

Lib vs. Con 
Est (SE) 

9i [1987] 

0.38 (0.20) 

0.12 (0.17) 

0.51 (0.23) 

0.13 (0.22) 

92 [1992] 

0.51 (0.20) 

0.13 (0.18) 

0.63 (0.24) 

-0.13 (0.23) 

ffs [Male] 

- 0.79 (0.11) 

-0.53 (0.09) 

-0.77 (0.13) 

-0.67 (0.12) 

94 [Age] 

-0.37 (0.04) 

-0.18 (0.03) 

-0.34 (0.04) 

-0.20 (0.04) 

g§ [Manual] 

0.65 (0.11) 

-0.05 (0.10) 

0.69 (0.13) 

-0.10 (0.12) 

[Inflation] 

0.87 (0.09) 

0.18 (0.03) 

0.76 (0.10) 

0.57 (0.09) 

b [LRdist] 

-0.62 (0.02) 

-0.54 (0.02) 


Log-likelihood -2963.68 一 1957.91 

Source: Skrondal and Rabe-Hesketh (2003b) 


over voters: and over constituencies: • Note that z\- k = x\- k 

in this application. 

(b) a one-factor model, inducing dependence within elections: 入 〆 1 )” 爲 ， within 
voters: A s ( 2 )^^) and within constituencies: A s ( 3 )/^ 3 ). At the election level, 
this is a common factor model because the t\- k can be interpreted as unique 
factors at that level. However, at the higher levels we have factor models 
without unique factors. 

(c) a correlated alternative-specific random intercept model, inducing depen¬ 
dence within voters: 7 爲 2 ) and within constituencies: 7 :( 3 ). 

We do not consider correlated alternative-specific random intercept mod¬ 
els at the election level since they would be extremely fragile in terms of 
identification. 

At a given level, e.g. the voter level, the model in (b) with random terms 
A s ( 2 )r ^=)，5 = 2, 3 is nested in the random coefficient model (c) with random 
terms 7 爲 2 )，s = 2, 3 since the variances of the random terms are unconstrained 
in both cases whereas the covariance is fixed at one in the factor model and 
unconstrained in the random coefficient model. 

When dependence is modeled at several levels, we use the same kind of 
model (e.g. (a), (b) or (c)) at the different levels in order to limit the set of 
models considered. Note that this is a practical consideration; combinations of 
models can be used at the same level and different models specified at different 
levels. Furthermore, no parameter restrictions are imposed across levels. The 
multilevel models are referred to by numbers indicating the levels followed by 
letters in parentheses for the model type, for instance 12(b) for a one-factor 
model specified at the election and voter levels. 

The sequence of fitted models, their number of parameters (# Par) and log- 
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likelihoods are reported in Table 13.2 for rankings (10-point adaptive quadra¬ 
ture was used). 


Table 13.2 Estimated models for rankings (the fixed part includes all covariates) 
Random Part 

Election Voter Constit. # Par Log-likelihood 


MO 


13 -2963.68 


Ml (a) 
•Ml(b) 

•M2 ⑷ 
M2(b) 
M2(c) 

M3(a) 

M3(b) 

M3(c) 

M12(a) 

M12(b) 




14 

15 

14 

15 

16 

14 

15 

16 

15 

17 


-2945.83 

-2842.73 

-2893.19 
-2693.78 
-2645.99 
-2948.44 
-2846.26 
一 2844.41 

-2893.19 

-2691.97 


M23(a) 

M23(b) 

M23(c) 


Here x s ijk = zf jk is the distance on the left-right political dimension. 
Source: Skrondal and Rabe-Hesketh (2003b) 


We first introduce latent variables only at the election level in A^l(a) and 
人 11 (b) and see from the table that the fit is considerably improved compared 
to the conventional model A10, indicating that there is cross-sectional depen¬ 
dence among the utilities at a given election (given the covariates in the fixed 
part). 

Latent variables are then introduced only at the voter level in models 
A12(a), A!2(b) and M2(c). The fit is much improved compared to the conven¬ 
tional model, indicating that there is unobserved heterogeneity at the voter 
level inducing longitudinal dependence within voters. We note in passing that 
models akin to A12(a) have attracted considerable interest in political science 
(e.g. Rivers, 1988). 

We next incorporate latent variables only at the constituency level in «M3(a )， 
*M3(b) and A13(c), and see that the fit is once more improved. Importantly, 
latent variables at a given level will not only induce dependence at that level 
but also induce dependence at all lower levels. Hence, it is possible that latent 
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variables at the election level are superfluous once latent variables are included 
at the voter level. 

To resolve this issue we include latent variables at both election and voter 
levels in 12(a) and 12(b). The improvement in fit achieved by including 
the additional election level latent variables is negligible suggesting that the 
cross-sectional dependence within elections is largely due to subject level het¬ 
erogeneity. We therefore do not need to include latent variables at the election 
level in the subsequent analyses as long as latent variables at the voter level 
are included. It would be surprising if the latent variables at the voter level 
were not needed when latent variables are specified at the constituency level 
since this would imply conditional independence between a voter’s utilities at 
the two elections given the constituency level effects. As expected, therefore, 
the A^23 models fit considerably better than the A43 models confirming that 
latent variables are needed at both voter and constituency levels. Regarding 
the choice between the M.23 models, we observe that the fit of the random 
coefficient model A123(a) is inferior to the factor model 23(b) and the ran¬ 
dom intercepts models 23(c). The choice between the latter two models, 
which are nested, suggests that the correlated random intercepts model is the 
preferred model. 

Estimates for our retained model */V123(c) are reported in Table 13.3 from 
rankings and discrete choices in the left and right panels, respectively. The 
estimates of the fixed regression coefficients are greater than those shown 
in Table 13.1 for the conventional model as expected. The random effects 
variances at the voter level are larger than at the constituency level consistent 
with a greater residual variability between voters within constituencies than 
between constituencies as would be expected. 

The variance of the random effect for Labour, representing residual variabil¬ 
ity in the utility differences between Labour and Conservatives, is particularly 
large reflecting the presence of a mixture of people with strong residual (un¬ 
explained) support for the Labour or Conservative parties. There is a positive 
correlation between the random effects for the Labour and Liberal parties sug¬ 
gesting that those who prefer Labour to the Conservatives, after conditioning 
on the covariates, also tend to prefer the Liberal party to the Conservatives. 
This is consistent with the Liberal party being placed between the Labour 
and Conservative parties and suggests that the [LRdist] covariate has not 
fully captured this ordering. 

To further interpret the estimates for the random part of the model, we 
present the model-implied residual (conditional on the covariates) correlation 
matrices for the utility differences in Table 13.4. To derive the residual cor¬ 
relations between utility-differences for a voter at a given election, write the 


© 2004 by Chapman & Hall/CRC 


Table 13.3 Estimates for correlated alternative-specific random intercepts model at 
voter and constituency levels M 23(c) 



Ranking 

Discrete Choice 

Lab vs. Con 
Est (SE) 

Lib vs. Con 

Est (SE) 

Lab vs. Con 
Est (SE) 

Lib vs. Con 
Est (SE) 

Fixed part 





gi [1987] 

0.77 (0.56) 

0.75 (0.37) 

0.95 (0.52) 

0.13 (0.52) 

92 [1992] 

1.28 (0.59) 

0.78 (0.39) 

1.32 (0.54) 

-0.30 (0.55) 

9s [Male] 

-0.99 (0.31) 

-0.71 (0.20) 

-1.15 (0.28) 

-0.96 (0.27) 

9i [Age] 

-0.74 (0.11) 

-0.37 (0.07) 

-0.61 (0.10) 

-0.36 (0.09) 

gl [Manual] 

1.57 (0.34) 

0.10 (0.22) 

1.31 (0.31) 

0.04 (0.29) 

gQ [Inflation] 

1.31 (0.18) 

0.74 (0.13) 

1.17 (0.19) 

0.97 (0.18) 

b [LRdist] 

-0.79 (0.04) 

-0.87 (0.05) 

Random part 





Voter-level 






16.13 (2.05) 

6.03 (0.90) 

7.43 (1.62) 

9.11 (1.61) 

咕 7 2 , 7 3 

8.53 (1.15) 

5.90 (1.30) 

Const .-level 





0 

4.91 (1.12) 

0.60 (0.29) 

3.12 (0.86) 

1.74 (0.60) 

%'十 

1.21 (0.48) 

1.11 (0.60) 

Log-likelihood 

一 2600.90 

-1748.95 


Source: Skrondal and Rabe-Hesketh (2003b) 


model for the three-dimensional vector of utility residuals as 



The covariance matrix of this vector of utility residuals becomes 
Cov(u ijfc -f^) = Z 屯 ( 2 )Z' + Z 屮⑶ 才 + 1 抓 2 /3. 


Consider the vector of differences in utility residuals, 


( 喂 - ■ - Kk-W 

wfe-4 b )-(«&-/&) 
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Table 13.4 Residual within-constituency correlation matrix implied by M23(c) for the utility differences (subscript k omitted) 


Overview 




voter j 

voter j f 



1987 

1992 

1987 1992 

voter j 

1987 

A 




1992 

B 

A 


voter f 

1987 

C 

C 

A 


1992 

C 

C 

B A 


B: Within voter between elections 



0.86 

0.63 

0.68 


碼 - <'7 

0.67 

0.29 


0.71 


A: Within voter and election 






K 

1 

0.73 

1 



0.78 

0.14 

1 

C: Between voters (within or between elections) 


L 

L 


篇 

0.21 



0.09 

0.22 

0.08 

0.06 

0.27 


Source: Skrondal and Rabe-Hesketh (2003b) 
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Defining the comparison matrix 

' -1 1 0 _ 

H = —10 1 , 

0 1 -1 

the covariance matrix of the differences in utility residuals becomes 
Cov(H (uijk — ^ijk)) = H Cov(uijk — fyfc) H’. 

Here the signs of the utility differences are such that we expect positive corre¬ 
lations if the Liberal party is positioned between the Conservative and Labour 
parties conditional on the covariates. For example, those who prefer Labour 
to Conservative (positive are likely to also prefer Liberal to Con¬ 

servative (positive 

As expected, the implied cross-sectional correlations at a given election (A) 
are larger than those implied from the fixed effects model (0.5 in column 1 and 
—0.5 in column 2). The longitudinal correlations across elections within voters 
(B) are all positive (the fixed effects model implies zero correlations), the 
difference in utilities between the Labour and Conservative parties being the 
most highly correlated across elections. As would be expected, the correlations 
between different voters in the same constituency (C) are much lower than 
between elections for the same voter. The correlation involving the Liberal- 
Conservative differences tend to be lower than the others, suggesting that 
these parties were less distinguishable from each other after adjusting for the 
covariates than the other pairs of parties. 


13.5 Post-materialism: A latent class model for rankings 

In 4 The Silent Revolution’ Inglehart (1977) contended that a transformation 
was taking place in the political cultures of advanced industrial societies. 
Materialist goals of economic and national security were fading from people’s 
basic value priorities. In their wake was a growing wave of post-materialist 
values - values which emphasize such goals as protecting the freedom of speech 
and giving people more say in important political decisions. With this change 
in values, the theory continues, came changes in the salient political issues, 
changes in political cleavages and changes in political participation. 

Inglehart argued that post-materialism is best measured by asking respon¬ 
dents to rank materialistic and post-materialistic values (instead of using for 
instance rating scales). In the eight nation survey conducted in 1974/1975 
(Wieken-Mayser et al” 1979; see also Barnes et al” 1979), also analyzed in 
Section 10.3, respondents were therefore asked to rank the following four po¬ 
litical values according to their desirability: 

1. [Order] maintain order in the nation 

2. [Say] give people more say in decisions of the government 

3. [Prices] fight rising prices 
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4. [Freedom] protect freedom of speech 

Materialists would be expected to give preference to [Order] and [Prices] 
whereas post-materialists would be expected to prefer [Say] and [Freedom]. 
Following Croon (1989)，we consider data 2,3 on the 2262 German respondents 
given in Table 13.5. The alternatives are numbered as above so that the first 
ranking in the table, 1234, represents the rank order [Order], [Say], [Prices], 
[Freedom]. 

The heterogeneity in value orientations can be modeled by assuming that 
respondents’ 4 utilities 5 for the political values vary randomly from the overall 
mean. Croon (1989) describes an exploratory latent class analysis of these 
rankings. The linear predictor for value s and latent class c can be parame¬ 
terized as 

略 =e s c , s = l ， 2,3,4, (13.5) 

where 

e\ = 0. 

Here we estimate the locations for all latent classes instead of setting their 
mean to zero. The parameter estimates are shown in Table 13.6 for one, two 
and three latent classes. 

From the one-class model it is clear that, on average, the materialistic values 
were preferred since the estimated log-odds e 1 and e 3 for [Order] and [Prices] 
are larger than e 2 and e 4 (=0) for [Say] and [Freedom]. The two-class solution 
suggests that about 21% of the population is post-materialistic (class 2) with 
negative log odds for values 1 and 3, whereas about 79% of the population is 
materialistic (class 1). In the three-class model, the materialists are split into 
32% who value [Order] most (class 3) and 45% who value [Prices] most (class 
1). The prevalence of post-materialism is now estimated as 23%. 

Table 13.5 shows the posterior probabilities for the three-class solution with 
the highest probability in bold. Class 1 has high posterior probability when 
[Prices] (3) is ranked high, class 2 if [Say] (2) and [Freedom] (4) are ranked 
high and class 3 if [Order] (1) is ranked high. Croon’s three-class estimates 
are given in his Table 3 where he uses a different parameterization, imposing 
the constraint e c = ^ instead of fixing e\ = 0. For the largest class with 
probability 0.45, this gives the locations 0.60 ， —1.07, 1.71 and —1.24 which 
(almost) agrees with Croon’s result of 0.59,-1.07,1.73,-1.25. 

Croon (1989) assesses the adequacy of different numbers of latent classes 
using the deviance defined as twice the difference in log-likelihoods between 
a given model and the full or saturated model. The log-likelihood for the 
saturated model can be obtained from the original data as 

^2n R \np R = -6269.52, 

2 The data used in this section were compiled by S.H.Barnes, R.Inglehart, M.K. Jennings 
and B. Far ah and made available by the Norwegian Social Science Data Services (NSD). 
Neither the original collectors nor NSD are responsible for the analysis reported here. 

3 The data in Table 13.5 are available at gllarnm.org/books 
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Source: Croon (1989) 


where ur and pr are the absolute and relative observed frequencies of ranking 
R and the sum is over all 24 observed rankings. The deviances are given in 
Table 13.6. 

We will now extend Croon’s analysis by allowing the class probabilities to 
depend on the following covariates: 

• [Female] a dummy variable for females 

• [Age] age categories: 15-30 (reference), 31-45, 46-60 and above 60 

• [Education] education categories: compulsory school (reference), middle 
level or academic level 


Table 13.5 Materialism data and posterior probabilities 
Data Results for 3 class model 

Posterior prob. (%) 
Pred. Class: I 2 3~ 

Ranking Freq. freq. Prior: 0.45 0.23 0.32 


071133026030224040214090 
86667842 11 35 

013266076368328597614737 
13 2474999 1 T 6629-928 

016721028712569572372383 
1 3 3 124 .8 8 7 2 8 3 2 6 1 

665703097121919611805733 
241549525636380584132433 
1 3 2 3 2 1 


799523831539047904109257 
320559426535391673232532 
1 3 2 3 2 1 

43423243413142412132312 

342423341413241412231312 

223344113344112244112233 

11111222222333333444444 
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Table 13.6 Parameter estimates for Croon’s latent class model 



One class 

Two classes 

Three classes 

Class 1 

Probability 7Ti 

1 

0.79 

0.45 

Locations 

e\ [Order] 

1.16 (0.04) 

1.94 (0.09) 

1.84 (0.15) 

ef [Say] 

0.21 (0.04) 

0.21 (0.05) 

0.17 (0.09) 

ef [Prices] 

1.28 (0.04) 

1.87 (0.09) 

2.96 (0.31) 

ef [Freedom] 

0 

0 

0 

Class 2 

Probability 7T2 

- 

0.21 

0.23 

Locations 

e\ [Order] 

- 

-0.87 (0.09) 

-0.76 (0.26) 

4 [ Sa y] 

- 

0.44 (0.12) 

0.56 (0.12) 

e| [Prices] 

一 

-0.21 (0.16) 

-0.09 (0.19) 

e\ [Freedom] 

- 

0 

0 

Class 3 

Probability 7T3 

- 

一 

0.32 

Locations 

e\ [Order] 

- 

- 

3.14 (0.40) 

4 i Sa y] 

- 

- 

0.21 (0.10) 

e| [Prices] 

- 

一 

1.18 (0.16) 

e\ [Freedom] 

- 

- 

0 

Log-likelihood 

-6427.05 

-6311.69 

一 6281.36 

Deviance 

315.05 

84.32 

23.58 

Degrees of freedom 

20 

16 

12 


We specify a structural model (see Section 4.3.2) as 
_ exp(v;g c ) 
njc 1 + exp(v^ c ) * 

Full covariate information was available on 2246 of the 2262 subjects. Re- 
estimating the three-class model without covariates on this subsample gave a 
log-likelihood of —6239.58. The estimates are shown under Mq in Table 13.7. 
Allowing the class probabilities to depend on the covariates increased the log- 
likelihood to —6043.82 (for a loss of 12 degrees of freedom) with estimates 
shown under in Table 13.7. Constraining the effect of age and education 
to be linear across categories (scored 1 ，2 , …） decreased the log-likelihood by 
only 2.79 (6 degrees of freedom) so that this model M 2 , with estimates shown 
in Table 13.7, is preferred. 
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For models Mi and M 2 with covariates, the interpretation of the latent 
classes remains as for model M.q without covariates with only small changes 
in the estimated parameters of the response model. Interestingly, being fe¬ 
male reduces the probability of being post-materialistic (class 2) as does [Age], 
whereas [Education] increases the probability. [Education] and [Age] both in¬ 
crease the probability of valuing [Order] (class 3) over [Prices] whereas [Fe¬ 
male] has little effect. 

For models M.q and M.^ the proportions of classification errors (or misclas- 
sification rate) fj estimated as described on page 237 in Section 7.4 are given 
in Table 13.8. If the ranking response has been observed, respondents can be 
classified by assigning them to the class with the highest posterior probability 
^(e c |yj,Xj ； 0). The corresponding estimated misclassification rates are given 
in rows 2 and 4 of the table for models Mo and M 2 , respectively. Model M 2 
has a slightly lower misclassification rate of 0.217 compared with 0.202 for 
model A4o because it uses covariate information. 

If the response has not been observed, we must base classification on the 
prior probabilities 7Tj c . In model Mo this amounts to assigning everyone to 
class 1 (with modal probability 71^1 =0.460)，giving a misclassification rate of 
fj = 0.540. In model M 2 the prior probability uses covariate information so 
that the proportion of classification errors is reduced to 0.466; see rows 1 and 3 
of Table 13.8. Such classification based on covariates or ‘concomitant’ variables 
only is for instance sometimes required for targeted marketing where rankings, 
choices or ratings of products are available only for a small 4 training set’ 
whereas covariate information is also available for future potential customers 
(e.g. Wedel, 2002b). 

Classification accuracy can also be expressed in terms of the proportional 
reduction in error (PRE)，here relative to model Mo when the response has not 
been observed (row 1). These PREs are given in the last column of Table 13.8. 
The PRE for model M 2 when the response is observed is 0.63. 


13.6 Consumer preferences for coffee makers: A conjoint choice 
model 


Conjoint analysis is a marketing research technique that can provide valuable 
information for market segmentation, new product development, forecasting 
and pricing decisions (e.g. Wedel and Kamakura, 2000). 

In a real purchase situation, shoppers examine and evaluate a range of 
features or attributes in making their final purchase choice. Conjoint anal¬ 
ysis examines these trade-offs to determine what features are most valued 
by purchasers. Once data are collected the researcher can conduct a number 
of ‘choice simulations’ to estimate market share for products with different 
attributes/features. This gives the researcher some idea which products or 
services are likely to be successful before they are introduced to the market. 
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Table 13.8 Mis classification rate and proportional reduction in error (PRE^) 


Information 

Model 

h 

PRE 

No information 


0.540 


Response only 

•Mo 

0.217 

0.60 

Covariates only 

M.2 

0.466 

0.14 

Response and covariates 

M.2 

0.202 

0.63 


As an example we will consider data 4,5 on a conjoint choice experiment 
for coffee makers. After in-depth discussions with experts and consumers, 
hypothetical coffee-makers were defined using the following five attributes: 

• [Brand] brand-name: Philips, Braun, Moulinex 

• [Capacity] number of cups: 6, 10, 15 

• [Price] price in Dutch Guilders f: 39, 69, 99 

• [Thermos] presence of a thermos flask: yes, no 

• [Filter] presence of a special filter: yes, no 

A total of sixteen profiles were constructed from combinations of the levels 
of these attributes using an incomplete design (excluding unrealistic combi¬ 
nations such as a coffee maker with all the features costing only 39f). In the 
choice experiment, respondents were then asked to make choices out of sets 
of three profiles, each set containing the same base alternative. 

There are several advantages of conjoint choice experiments as compared to 
conventional conjoint analysis based on rating scales. One advantage is that 
choices may be more realistic since they resemble the purchasing situation, 
another that the problem of individual differences in interpreting rating scales 
is avoided. 

To construct the choice sets, the profiles were divided in two different ways 
into eight sets of two alternatives. A base alternative was added to each set, 
resulting in two groups of eight choice sets with three alternatives shown in 
Table 13.9. 185 respondents were recruited at a large shopping mall in the 
Netherlands. These respondents were randomly divided into two groups of 94 
and 91 subjects and each group was administered one of the groups of eight 
choice sets. 

The multinomial logit model containing only fixed effects of the vector of 
attributes x s for alternative (or profile) s has linear predictor 

^ = x s, b 

for subject j. Note that the intercepts m s are omitted to identify the model 

4 We thank Michel Wedel for providing us with this dataset which accompanies the GLIM- 
MIX program (Wedel, 2002b). 

5 The data can be downloaded from gllamm. org/books 
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Table 13.9 Choice sets for conjoint choice experiment 


Set 

Alternative 1 


Alternative 2 


Brand 

Cap. 

Pr. 

Th. 

Fi. 

Brand 

Cap. 

Pr. 

Th. 

Fi. 





Choice sets for group 1 





1 

Philips 

10 

69 

一 

Fi 

Braun 

15 

69 

Th 

Fi 

2 

Braun 

6 

69 

一 

一 

Moulinex 

10 

69 

Th 

Fi 

3 

Braun 

10 

39 

一 

一 

Braun 

10 

99 

Th 

Fi 

4 

Philips 

6 

39 

Th 

Fi 

Braun 

10 

39 

Th 

一 

5 

Philips 

10 

69 

Th 

一 

Moulinex 

15 

39 

一 

Fi 

6 

Braun 

6 

69 

一 

Fi 

Moulinex 

10 

69 

一 

一 

7 

Philips 

15 

99 

一 

— 

Moulinex 

6 

99 

Th 

— 

8 

Braun 

15 

69 

Th 

- 

Braun 

10 

99 

- 

Fi 





Choice sets for group 2 





1 

Philips 

10 

69 

Th 

一 

Moulinex 

10 

69 

Th 

Fi 

2 

Philips 

15 

99 

一 

一 

Braun 

15 

69 

Th 

Fi 

3 

Braun 

10 

39 

一 

— 

Moulinex 

15 

39 

— 

Fi 

4 

Braun 

15 

69 

Th 

一 

Braun 

10 

99 

一 

Fi 

5 

Philips 

10 

69 

一 

Fi 

Moulinex 

6 

99 

Th 

一 

6 

Braun 

6 

69 

一 

一 

Braun 

10 

99 

Th 

Fi 

7 

Braun 

6 

69 

一 

一 

Moulinex 

10 

69 

一 

一 

8 

Philips 

6 

39 

Th 

Fi 

Braun 

10 

39 

Th 

- 


Alternative 3 (base) 







Set 

Brand 

Cap. 

Pr. 

Th. 

Fi. 






1 

Philips 

6 

69 

- 

Th 







because the covariates x s do not vary between subjects. Excluding a constant 
also has the advantage that we can make predictions involving profiles not 
included in the data. 

This model is unrealistic since it assumes that all subjects have the same 
mean utilities for the coffee makers. A more realistic model allows subjects to 
differ in their coefficients for the attributes, reflecting different preferences or 
‘tastes’. The linear predictor becomes 

g = x s, b + x s/ 7i . 

If the market is believed to consist of different 4 segments，that are homoge¬ 
neous in their preferences, a latent class model can be specified using a discrete 
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random coefficient vector 


lj = e c , c= 1 , … ， C ， 

with probabilities 7 r c , where c labels the market segments. This model is often 
referred to as a mixture regression model. We will exclude the fixed part x s 
from the model instead of imposing the constraint = 0, so that the 

linear predictor for class c becomes 

v a jc = x s/ e c . 

Alternatively, the random tastes 7 ^ can be assumed to be continuous. Haai- 
jer et al. (1998) develop a random coefficient multinomial probit model where 
is multivariate normal with covariance structure 

Cov(jj) = QQ’. 

Here, Q is a 8 x T matrix and some constraints on Q are necessary for identifi¬ 
cation. Even with these constraints, they found identification to be fragile for 
T> 1. With T=l, the multinomial logit version of the model can be written 
in GRC notation as 


= x s/ b + r/jx s/ A, Vax(%) = 1. 

Here we have set jj = iy 入 , thereby reducing the number of dimensions from 
eight to one and greatly simplifying estimation. 

The estimates for the one-class ， two-class and random coefficient models 
are given in Table 13.10. The coefficients in the one-class model suggest that 
Braun is the least popular brand, a capacity of 10 cups is most desirable, 
followed by 15 cups and then 6 cups, cheaper coffee makers (39 or 69f) are 
preferred to the most expensive ones (99f) and coffee makers with filters and 
thermos flasks are preferred to coffee makers without these features. 

The two-class model fits considerably better than the one-class model (dif¬ 
ference in log-likelihoods is 188). The size of the first market segment is es¬ 
timated as 72% and that of the second segment as 28%. The first market 
segment cares little about brands whereas the second strongly prefers Philips 
and dislikes Braun. The first segment has a more marked preference for 10 
cups and cares more about prices, as well as having a stronger preference for 
thermos flasks than the second segment. 

The random coefficient model suggests that subjects vary mostly in the 
degree to which they prefer Philips over Braun, the extent of dislike of a 
6 -cup capacity and their price sensitivity. It is interesting to note that the log- 
likelihood for the independent multinomial logit model is very close to that 
found by Haaijer et al. (1998) for the independent multinomial probit model 
using simulated maximum likelihood (—1298.7 and —1299.9, respectively). 
The log-likelihoods for the logit and probit versions of the random coefficient 
model are also very similar (—1086.1 and —1086.6, respectively). 

Haaijer et al. (1998) also consider choice simulations for different product 
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Table 13.10 Estimates for conjoint choice analysis 




Latent class models 


Random coeff. model 



Ml 

One class 

M2 

Two classes 


M3 

Variable 

Par. 

Est (SE) 

Est 

(SE) 

Par. 

Est 

(SE) 

[Brand] 

Philips 


0.04 (0.06) 

-0.11 

(0.10) 

bi 

-0.31 

(0.15) 

Braun 


- 0.33 (0.06) 

-0.14 

(0.10) 

b 2 

-0.65 

(0.14) 

Moulinex 

4 

0 

0 

- 

bs 

0 

- 

[Capacity] 

6 

el 

— 1.02 (0.07) 

-1.68 

(0.13) 

b 4 

-2.08 

(0.20) 

10 

el 

0.49 (0.05) 

0.87 

(0.10) 

h 

0.13 

(0.13) 

15 

eh 

0 

0 

- 

b 6 

0 

- 

[Price] 

39 

e 7 

0.31 (0.09) 

0.82 

(0.21) 

b 7 

1.52 

(0.27) 

69 

4 

0.37 (0.06) 

0.33 

(0.12) 

b 8 

1.20 

(0.16) 

99 

4 

0 

0 

- 

& 9 

0 

- 

[Thermos] 

yes 

e 10 

0.31 (0.04) 

0.57 

(0.09) 

bio 

1.09 

(0.15) 

no 

eh 

0 

0 

- 

bn 

0 

— 

[Filter] 

yes 

eh 

0.37 (0.06) 

0.46 

(0.06) 

bi2 

0.97 

(0.11) 

ln[7Tl/(l - 7Tl)] 

els 

^0 

0 

0 

0 

0.92 

(0-21) 

bis 

0 


[Brand] 

Philips 

el 

- 

0.56 

(0.15) 

Ai 

0.51 

(0.18) 

Braun 

el 

- 

-1.00 

(0.20) 

入2 

-0.51 

(0.20) 

Moulinex 

el 

- - 

0 

- 

入 3 

0 

- 

[Capacity] 

6 


- 

-0.19 

(0.12) 

入 4 

1.72 

(0.22) 

10 

el 

- 

0.13 

(0-12) 

入 5 

0.19 

(0.21) 

15 

el 

- 

0 

- 

入6 

0 

- 

[Price] 

39 

4 

— — 

-0.31 

(0-18) 

At 

-1.72 

(0.30) 

69 

el 

- 

0.14 

(0.12) 

As 

-1.11 

(0.21) 

99 

el 

- - 

0 

- 

入9 

0 

— 

[Thermos] 

yes 

e?o 

- 

0.17 

(0.10) 

入10 

0.09 

(0.14) 

no 

efi 

- - 

0 

— 

An 

0 

— 

[Filter] 

yes 

e 12 

- - 

0.50 

(0.10) 

入12 

-0.60 

(0.19) 

Log-likelihood 

6?3 

-1298.71 

0 

-1110.63 

入13 

0 

-1086.07 
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introductions. They use the four profiles listed in Table 13.11 to generate three 
4 managerially relevant’ situations: 


Table 13.11 Profiles for market simulations 


Prof. Brand 

Cap. Pr. Th. Fi. 

Prof. Brand Cap. Pr. Th. Fi. 

PI Philips 
M3 Moulinex 

10 39 - 
15 69 - 

B2 Braun 15 69 Th - 

P4 Philips 10 69 - Fi 


• Product modification: The current market consist of two products, Philips 
(PI) and Braun (B2). Philips modifies its existing product (from PI to P4) 
by introducing a special filter and increasing the price. 

• Product line extension: The current market consist of two products, Philips 
(PI) and Braun (B2). Philips introduces a new product (P4) in addition 
to its existing product. 

• Introduction of a ‘me-too’ brand: The current market consist of two prod¬ 
ucts, Philips (PI) and Braun (B2). A third brand, Moulinex, introduces a 
product (M3)，close to the existing product of the current market-leader 
Braun, but without the thermos-flask. 

The market-share predictions for these three scenarios are given in Ta¬ 
ble 13.12 for each of the models. Product modification leads to a greater 
increase in market share for the independent and two-class models than for 
the random coefficient model. Product-line extension by Philips leads to a 
greater decrease in Braun’s market share for the independent and two-class 
models than for the random coefficient model. Similarly, the introduction of a 
new brand (^me-too 5 ) leads to a greater decrease in Braun^ market share for 
the independent and two-class models than for the random coefficient model. 
These simulations illustrate how assuming different dependence structures can 
have important implications for predictions. 


Table 13.12 Predicted market shares in percent 


Product modification Product-line extension Intro, ‘me-too 5 brand 
Pr. Ml M2 M3 Pr. Ml M2 M3 Pr. Ml M2 M3 


Before: 

PI 41.5 45.8 43.9 

B2 58.5 54.2 56.1 

After: 

P4 59.8 59.4 55.2 

B2 40.2 40.6 44.8 


PI 41.5 45.8 43.9 
B2 58.5 54.2 56.1 


PI 22.2 21.6 18.6 

B2 31.3 31.4 37.1 

P4 46.5 47.1 44.3 


PI 41.5 45.8 43.9 
B2 58.5 54.2 56.1 


PI 26.2 30.3 30.4 

B2 37.0 39.4 43.8 

M3 36.8 30.3 25.7 


© 2004 by Chapman & Hall/CRC 






13.7 Summary and further reading 

We have considered models for comparative responses in this chapter, basing 
all applications on logistic regression models with continuous or discrete latent 
variables. The first application considered panel or longitudinal data on dis¬ 
crete choices and rankings from British general elections. The model included 
both alternative and unit-specific covariates and latent variables at different 
hierarchical levels. The second application concerned rankings on materialistic 
and post-materialistic values among Germans. Latent class models previously 
suggested for this application were extended by including covariates. The third 
application was based on data from a conjoint choice experiment where re¬ 
spondents were asked to choose among coffee makers. The profiles of the coffee 
makers making up the choice sets were constructed according to an experi¬ 
mental design, helping the market researcher to gauge which product will 
be successful before introducing it to the market. See Wedel and Kamakura 
(2000) for a modern treatment of market segmentation. 

In all applications discussed in this chapter, the responses correspond to 
decisions, but this need not be the case. Other examples of polytomous re¬ 
sponses are eye-color or diagnosis. Yang (2001) and Skrondal and Rabe- 
Hesketh (2003c) use a three-level multinomial logit model to analyze the qual¬ 
ity of physicians’ treatment decisions. Pairwise comparison data also arise in 
tournaments such as chess or football, whereas rankings could be the finishing 
order in a horse-race. 

In addition to the numerous references given in the introduction of this 
chapter, see Hartzel et al. (2001) and Train (2003) for useful overviews. For 
a review of latent variable models for polytomous responses and rankings, 
see Skrondal and Rabe-Hesketh (2003b). A more introductory treatment is 
given in Skrondal and Rabe-Hesketh (2003a). Takane (1987) and Bockenholt 
(2001a) discuss latent variable models for pairwise comparisons. 
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CHAPTER 14 


Multiple processes and mixed 
responses 


14.1 Introduction 

In the previous application chapters we considered responses of particular 
types. In this chapter we exploit the generality of the general model frame¬ 
work and discuss applications where the responses are from multiple processes 
and possibly of mixed type. Combinations treated here are continuous and di¬ 
chotomous, dichotomous and counts, continuous and continuous time survival 
and dichotomous and continuous time survival. Importantly, we will see that 
it is often not permissible to simply decompose such problems, that is by 
separately modeling the different processes. As in other application chapters, 
we will use continuous latent variables with parametric distributions and dis¬ 
crete latent variables, interpretable as latent classes or as a 4 nonparametric ， 
estimator of an unspecified distribution. The usefulness of structural models, 
regressing latent variables on other variables, will be demonstrated. 

14.2 Diet and heart disease: A covariate measurement error model 

14-2.1 Introduction 

We consider estimating the effect of dietary fiber intake on coronary heart 
disease (CHD), following the analysis in Rabe-Hesketh et al. (2003ab). 

The dataset is on 337 middle aged men, recruited between 1956 and 1966 
and followed until 1976 (Morris et al” 1977) 1 . There were two occupational 
groups, bank staff and London Transport staff (drivers and conductors). At 
the time of recruitment, the men were asked to weigh their food over a seven- 
day period from which the total number of calories were derived as well as 
the amount of fat and fiber. Seventy-six bank staff had their diet measured in 
the same way again six months later. Coronary heart disease was determined 
from personnel records and, after retirement, by direct communication with 
the retired men and by ‘tagging’ them at the Registrar General’s Office. 

The explanatory variables used in the analysis are 

• [Age] age in years 

• [Transp] dummy variable for London Transport staff versus bank staff 
1 We thank David Clayton for providing us with these data. 
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14-2.2 Logistic regression with covariate measurement error 

We will estimate the association between fiber intake (exposure) and heart dis¬ 
ease using logistic regression, taking into account that exposure measurement 
is imperfect and that replicate measurements are available for a subsample. 
This may be accomplished by introducing a latent variable for unobserved 
true exposure and specifying three submodels (following Clayton’s (1992) ter¬ 
minology): an exposure model, a measurement model and a disease model. 

Exposure model 

True fiber intake rjj for subject j is modeled using the structural model 

Vj = x^-7+Ci ， (14.1) 

where the covariates x) are [Age] ， [Transp] and their interaction. Traditionally, 
a normal exposure distribution is assumed with 〜 N(0, -?/；), but we will 
also consider a nonparametric exposure distribution of the kind introduced in 
Section 4.4.2. 

Measurement model 

The classical measurement model assumes that the ith fiber measurement 
for subject j. ， y 勿 ， differs from true fiber intake r]j by a normally distributed 
measurement error 

Vij = Vj ^ij5 〜 iV(0, 0), 

where is independent from rjj (and Q in the exposure model). Two mea¬ 
surements were available for a subsample, but information from subjects only 
providing one measurement is also used. We will also allow for a 4 drift, in 
the fiber measurements by including a dummy variable [Drift] for the second 
measurement. 

Disease model 

The disease model specifies a logistic regression of heart disease Dj on true 
fiber intake 

logit[Pr(Dj' = l|^)] = Xj/3 + 

Here, the factor loading 入 represents the effect of true exposure on the log-odds 
of disease. The same covariates are used in both the disease and exposure 
models to allow for both direct and indirect effects of these covariates on 
disease. We also consider a model with /3 = 0, thus specifying only indirect 
effects of the covariates on heart disease via true fiber intake. Both kinds of 
models are shown in Figure 14.1. 

Joint model 

Let 1,2 index the fiber measurements and i = 3 the disease response, and 
define the corresponding dummy variables di“ and dsi ，The joint response 
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Direct and indirect effects 




Figure 14.1 Path diagram for covariate measurement error models 


model for measurement and disease can then be written in GRC formulation 
as 

y%3 = (du + d 2 i)rjj + + A^*] 

= dsiX.j/3 + r]j [(du + d2i) + Ac^] ， 

and the structural model for exposure remains as in (14.1). 

Note that measurement error is assumed to be nondifferential since it is 
conditionally independent of disease status Dj given true exposure rjj (no 
direct path between yij and Dj in the path diagrams). The joint model is 
a generalization of the MIMIC model discussed in Section 3.5, allowing re¬ 
sponses of mixed type (diet measurements are continuous and heart disease is 
dichotomous) and direct effects of covariates on the responses. 

The parameter estimates, based on a normal and nonparametric exposure 
distribution and including and excluding direct effects of the covariates on 
heart disease, are given in Table 14.1. Similar estimates of the effect of true 
fiber on heart disease are obtained for all four cases considered, with an odds 
ratio of about 0.87 per gram of fiber per day. Fiber intake therefore appears 
to have a remarkable protective effect on heart disease. However, we have 
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not adjusted for exercise, an important confounder which both increases total 
food intake (including fiber) and reduces the risk of heart disease (Morris et 
al., 1977). London Transport staff eat less fiber than bank staff, fiber intake 
decreases with age and there is no substantial interaction between occupation 
and age. Excluding the direct effect of these variables on heart disease leads 
to a very small decrease in the log-likelihood, both for the normal and non¬ 
par ametric exposure distributions. Thus the covariates appear to affect the 
risk of heart disease only indirectly via fiber intake as in the lower panel of 
Figure 14.1. 

For the model with indirect effects only, the reliability of the fiber measure¬ 
ments given the covariates is estimated as 0.77 when exposure is assumed to 
have a normal distribution and as 0.80 using NPMLE. The somewhat higher 
estimate for NPMLE is consistent with simulations reported in Rabe-Hesketh 
et al. (2003a) and Hu et al. (1998). 


Table 14.1 Parameter estimates for heart disease data 



Indirect effects only 

Direct and indirect effects 


Quadrature 

NPMLE 

Quadrature 

NPMLE 


Est 

(SE) 

Est (SE) 

Est 

(SE) 

Est (SE) 

Exposure model 

7i [Transp] —1.66 

(0.64) 

-1.12 (0.44) 

-1.68 

(0.64) 

-1.14 (0.44) 

72 [Age] 

-0.21 

(0.10) 

-0.29 (0.06) 

-0.21 

(0.10) 

-0.28 (0.06) 

73 [Age] 
x [Transp] 

0.17 

(0.11) 

0.22 (0.07) 

0.17 

(o.n) 

0.22 (0.07) 

Var(Ci) 

23.66 

(2.53) 

24.94 ㈠ 

23.64 

(2-52) 

24.98 ㈠ 

Measurement model 
ao [Const] 17.93 

(0.49) 

17.58 (0.40) 

17.95 

(0.49) 

17.60 (0.40) 

ai [Drift] 

0.24 

(0.42) 

0.16 (0.38) 

0.23 

(0-42) 

0.15 (0.38) 

e 

6.95 

(1.14) 

6.13 (0.85) 

6.95 

(1_14) 

6.13 (0.85) 

Disease model 

A 

-0.13 (0.05) 

-0.15 (0.06) 

-0.13 

(0.05) 

-0.15 (0.06) 

Po [Const] 

-2.08 

(0.21) 

-2.07 (0.21) 

-1.92 

(0.28) 

-1.90 (0.28) 

/?i [Transp] 


- 

- 

-0.26 

(0.34) 

-0.27 (0.34) 

P2 [Age] 


- 

- 

0.04 

(0.06) 

0.04 (0.06) 

Ps [Age] 
x [Transp] 


— 

— 

-0.03 

(0.06) 

-0.03 (0.07) 

Log likelihood 

-1373.33 

-1320.25 

-1372.35 

-1319.79 


The NPMLE solutions required six masses which are displayed for the case 
of indirect effects only in the upper panel of Figure 14.2 (the distribution for 
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the model also including direct effects was very similar). As might be expected, 



Figure 14.2 NPMLE probability masses and cumulative distribution for NPMLE and 
normal distributions 

the distribution is positively skewed. Another indication of nonnormality of 
the true fiber intake distribution is the considerable increase in log-likelihood 
when relaxing the normality assumption. (The NPMLE solutions have nine 
extra parameters for a change in log-likelihood of about 53.) The lower panel 
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of the figure shows the estimated cumulative distribution both for normal and 
nonparametric true fiber intake distributions. 

The empirical Bayes predictions of the disturbances Q in the exposure 
model are shown in Figure 14.3 for both NPMLE and normal exposure. 
The discrepancy appears to be greater for larger predictions where NPMLE 


8 


b 


8 


-10 0 10 20 



Adaptive Gaussian quadrature predictions 


Figure 14.3 Empirical Bayes predictions of true residual fiber for normal and non- 
parametric exposure distributions. 

produces larger values. Simulations reported in Rabe-Hesketh et al. (2003a) 
showed that NPMLE predictions are superior to those assuming normality if 
the true distribution is highly skewed. In particular, the parametric empirical 
Bayes predictions are too severely shrunken when the true values are large. 

14.3 Herpes and cervical cancer: A latent class covariate 
measurement error model for a case-control study 

14-3.1 Introduction 

Sampling is sometimes conducted stratified on a dichotomous response. Such 
retrospective designs are called case-control designs in epidemiology (e.g. Bres- 
low and Day, 1980; Schlesselman, 1982) and choice-based sampling in eco¬ 
nomics (e.g. Manski，1981). Sampling typically proceeds by including all 
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‘cases’ (or ill persons in epidemiology) and a random sample of roughly as 
many 4 controls 5 (healthy persons) as there are cases. Importantly, when the 
case prevalence is low, the loss of efficiency compared to a prospective cohort 
design is small. An important merit is that more accurate covariate informa¬ 
tion may be potentially obtained than in a prospective cohort design since 
measurements are only required for a sample of controls. 

It has been shown that logistic regression for this retrospective design can be 
performed as if the design were prospective (e.g. Farewell, 1979), giving appro¬ 
priate estimates except for the intercept. Importantly, Carroll et al. (1995b) 
point out that prospective analysis of case-control studies with covariate mea¬ 
surement error also typically produces consistent estimators and asymptoti¬ 
cally correct standard errors. Specifically, for the case of a categorical expo¬ 
sure and nondifferential measurement error, Satten and Kupper (1993) and 
Carroll et al. (1995b) demonstrate that maximizing the prospective likelihood 
produces correct inferences. We adopt this approach in this section, effectively 
ignoring the retrospective design in specifying the likelihood. 


Table 14.2 Data for case control study of cervical cancer 


True exposure 
[Case] [TrueE] 

Measured exposure 
[MeasE] 

Count 

Validation sample 

1 0 

0 

13 

1 0 

1 

3 

1 1 

0 

5 

1 1 

1 

18 

0 0 

0 

33 

0 0 

1 

11 

0 1 

0 

16 

0 1 

1 

16 

Incomplete sample 

1 

0 

318 

1 

1 

375 

0 

0 

701 

0 

1 

535 


Source: Carroll et al. (1993) 


Hildesheim et al (1991) conducted a case-control study to examine a po¬ 
tential association between exposure to herpes simplex virus type 2 and inva¬ 
sive cervical cancer. The exposure was measured for cases and controls using 
an inaccurate western blot procedure. To investigate misclassification, a gold 
standard measurement using a refined western blot procedure was obtained 
for a random sub-sample of women from both groups, the Validation sample’. 
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We will treat this measurement as ‘true exposure 5 [TrueE] and refer to the 
inaccurate western blot as 4 measured exposure’ [MeasE]. The data 2 have been 
analyzed by Carroll et al. (1993) and others and are given in Table 14.2. 

We can of course validly estimate the odds ratio for true exposure based on 
just the validation sample. However, this approach would be inefficient since 
the information from the large incomplete sample is discarded. 


14-3.2 Latent class logistic regression 

We can estimate the odds ratio for true exposure based on all information 
by specifying three component models as in the previous section: an exposure 
model, a measurement model and a disease model. 


Exposure model 


Let Xij denote true exposure in the validation sample (1 for exposed and 0 
for unexposed). We can treat the missing exposure in the incomplete data 
sample as a dichotomous latent variable r]j taking the values r]j = e c , c=l,2, 
ei = l, e 2 = 0. The exposure models for the two samples are simply 

logit [Pr(X y = 1)] = Qo, 


and 


logit[7Ti] = ^ 0 , 


where 7Ti is the probability that a subject in the incomplete data sample is in 
latent class 1. 


Measurement model 

An assumption often made in covariate measurement error problems, for in¬ 
stance in Section 14.2.2, is that there is nondifferential measurement error, i.e. 
that measured exposure is conditionally independent of disease status given 
true exposure. Let W\j and Woj denote measured exposure in the valida¬ 
tion and incomplete data samples, respectively. Also let D\j and Doj denote 
disease status (1 for cases and 0 for controls) in the validation sample and 
incomplete data samples, respectively. A nondifferential measurement error 
model can be specified as 

logitpr^ylXi^Dy}] = ao + aiX^, 
for the validation sample and 

logit[Pr(W 0j ]?? J ',i)oi)] -a 0 + a^j, 
for the incomplete data sample. 

Carroll et al. (1993) point out that there is evidence for differential mea¬ 
surement error in the validation sample with cases having a higher estimated 

2 The data can be downloaded from gllamm.org/books 
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sensitivity (18/23 = 0.78) than controls (16/32 = 0.50). The saturated differ¬ 
ential measurement error model includes the effects of true exposure, disease 
status and their interaction, 

logit \Pr(Wij\Xij, Dij)] = ao + a\X\j + a 2 -Di t7 - + asDijX±j 9 

and 

logit[Pr(W 0j \r] j ,Doj)] = a 0 + a^j + a 2 D 0j + c^DojVj. 

Disease model 

For the validation sample, we can specify a logistic regression model for disease 
given true exposure 

logitprpw = l|Xi,-)] =/?o + 

For the remainder of the sample, true exposure status is missing and repre¬ 
sented by latent classes. The model for disease status D 0 j in the incomplete 
data sample then is 

logit [Pr(D 0 j = MVj)) = A) + PiVj. 


Joint model 

Let yij denote the three responses, exposure (i = 1), measurement (i = 2) 
and disease (i = 3) with dummy variables d r i = 1 if r = i and 0 otherwise. 
Let Vj 二 1 if subject j is in the validation sample and 0 otherwise. The joint 
model (allowing for differential measurement error) can be written in the GRC 
formulation as 

= d u [Qo] + d 2 i[a 0 + a-iX-ijVj + 

+ a 2 D lj v j + a 2 D 0 j(l - Vj) + a^X^D^vj + a 3 rjjD 0： j(l- Vj)] 

+ dsi[Po + PiXijVj + _ Vj)] 

=Qodu + cxod2i + a\X\jVjd2i + ot^DxjVjd^% + a2-^0j(l — Vj)d2i 
+ asXijDijVjd2i + Pod3i + PiXijVjdsi 
+ %[ 戊 1(1 — Vj)d 2 i + a 3 D 0 j{l - vj)d 2 iPi(l - Vj)d 3i ], 

and 

logit [7Ti] = ^0- 

Here ai ， o ：3 and /?i are factor loadings constrained equal to the corresponding 
regression coefficients for the validation sample. 

The parameter estimates for the models assuming differential and nondif¬ 
ferential measurement error are given in Table 14.3. We can use the esti¬ 
mates for the measurement model to estimate the conditional probabilities 
Pr(Wj = 1 \Xj , Dj), providing information on the sensitivity and specificity of 
the measurements, and these are given in Table 14.4. 

The likelihood ratio test statistic for comparing the two models is 5.14 for 2 
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Exposure model 
go [Cons] 

Measurement model 
ao [Cons] 
a\ [TrueE] 
ol<i [Case] 

as [TrueE] x [Case] 

Disease model 
[Cons] 

/3i [TrueE] 

Log likelihood 


Table 14.4 Conditional probabilities Pr(Wj = l\Xj,Dj) of measured exposure 


Xj 

Dj 

Differential 

Nondiff. 

[TrueE] 

[Case] 

meas. error 

meas. error 

1-Specificity 



0 

0 

0 

1 

0.31 

0.19 

0.26 

0.26 

Sensitivity 



1 

0 

0.58 

0.68 

1 

1 

0.79 

0.68 


degrees of freedom, so that the simpler model assuming nondifferential mea¬ 
surement error appears to be adequate. The log odds ratios for true exposure 
are estimated as 0.608 (0.350) assuming differential measurement error and 
0.958 (0.237) assuming nondifferential measurement error. Using just the vali¬ 
dation sample, the odds ratio is estimated as 0.681 (0.400). Note that we have 
gained very little in terms of precision by analyzing the full sample, probably 
because of the low sensitivity and specificity of measured exposure. Using 
the inaccurate measurement of exposure instead of the gold standard for the 
full sample gives an estimate of 0.453 (0.093). As expected, this estimate is 
attenuated compared with the other estimates due to regression dilution. Our 
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Differential 
meas. error 


Nondifferential 
meas. error 


(SE) 


(SE) 



1 9 6 6 5 2 0 
7 5 9 15 9 5 

1 2 4 6 9 13 3 

(o.(o.(o.(o.(o.(o.(o..6 

3 19 5 9 7828 

2 9 9 6 4 9 0 ^ 
.o.70.66.86 _ 

o.0.1.0.1.0.0. 
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estimates based on a prospective likelihood agree very closely with the esti¬ 
mates obtained by Carroll et al. (1993) using a retrospective likelihood. 

For the incomplete data, we can predict true exposure to the herpes virus 
using the posterior probabilities of true exposure given measured exposure 
and disease status. These posterior probabilities, providing information on 
positive predictive values (PPV) and negative predictive values (NPV)，are 
given in Table 14.5. 


Table 14.5 Posterior probabilities Pr(Xj = l\Wj,Dj) of true exposure 


Wj 

[MeasE] 

A 

Differential 

meas. error 

Nondiff. 

meas. error 

1-NPV 

0 

0 

0.33 

0.24 

0 

1 

0.28 

0.45 

PPV 

1 

0 

0.59 

0.65 

1 

1 

0.86 

0.83 


14.4 Job training and depression: A compiler average causal effect 
model 

14-4-1 Introduction 

Little and Yau (1998) analyzed data from the JOBS II intervention trial de¬ 
scribed in Vinokur et al. (1995) 3 . Unemployed individuals who had lost their 
jobs within the last 13 weeks and were looking for a job where randomized to 
receive either five half-day sessions of job training plus a booklet briefly de¬ 
scribing search methods and tips (treatment group) or just the booklet (con¬ 
trol group). The aim of the intervention was to prevent poor mental health 
and promote high-quality re-employment. Noncompliance was a problem in 
the intervention group with 46% never attending any of the seminars. 

1^.4-2 The compliance problem 

Consider a randomized study with two treatment arms. For simplicity, we will 
refer to these as the treatment (or active treatment) group and the control 
group, although the control treatment need not be a placebo. 

Imbens and Rubin (1997ab) classified study participants into four types 

3 These data can be downloaded from gllamm. org/books or the from ICPSR at the Insti¬ 
tute for Social Research of the University of Michigan at 
http://www.icpsr • 皿 icli.edu/access/index.html (under JOBS #2739). 
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of compliance status: compilers (adhere to their assigned treatment), always- 
takers (take active treatment regardless of assignment), never-takers (never 
take active treatment regardless of assignment) or defiers (take opposite treat¬ 
ment to assigned). 

Importantly, compliance status cannot be observed in a parallel trial since 
whether or not a subject takes the treatment can only be observed in the 
assigned treatment group (and not the other group). Consider first a subject 
assigned to the active treatment. If the subject takes the active treatment 
he may either be a compiler or an always-taker. If he fails to take it he may 
either be a never-taker or a defier. Consider then a subject assigned to the 
control treatment. If the subject takes the control treatment he may either be 
a compiler or a never-taker. If he fails to take it he may either be an always- 
taker or a defier. It might be helpful to have a look at the graphical summary 
of this setup provided in Figure 14.4. 

It often seems reasonable to assume that there are no defiers, sometimes 
called the 4 monotonicity 5 assumption. This is because participation in a trial 
is generally voluntary and nonadherence to the assigned treatment is usually 
due to prior preference for one of the treatments. Thus, we will henceforth 
assume that there are no defiers. As has been demonstrated by Imbens and 
Rubin (1997b), this assumption is useful for identification. Specifically, if the 
probabilities of actually taking the assigned treatments in both groups can be 
estimated, the probabilities of being an always-taker, never-taker and compiler 
become identified. 

This can be seen as follows (consulting Figure 14.4 may once more be help¬ 
ful): In the treatment group, the probability of being a never-taker is equal 
to the probability of not taking the treatment since there are no defiers. Due 
to randomization, the probability of being a never-taker is the same in the 
control group. In the control group, the probability of being a compiler can 
then be obtained by subtracting the probability of being a never-taker from 
the probability of taking the control treatment. Due to randomization, the 
probability of being a compiler is the same in the treatment group. In the 
treatment group, the probability of being an always-taker can now be ob¬ 
tained by subtracting the probability of being a compiler from the probability 
of taking the treatment. Finally, randomization once more ensures that the 
probability of being an always-taker is the same in the control group. 

Conventionally, three different kinds of analysis have been conducted when 
there is noncompliance : 

1. An 4 as-randomized analysis’ or intention to treat (ITT) analysis compares 
the outcomes of participants by assigned group as shown in Figure 14.4. 
The ITT effect is the effect of treatment assignment rather than the effect 
of treatment taken (often called ‘effectiveness’ as opposed to ‘efficacy’). 
Importantly, the standard ITT estimator is protected from selection bias 
by randomized treatment assignment. 

2. An 4 as-treated analysis’ compares the outcomes of participants by treat- 
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ment actually taken. Here, randomization is violated and causal interpre¬ 
tations of treatment effects may be dubious. 

3. A ^per-protocol analysis’ compares subjects who adhered to the assigned 
treatments, excluding those not adhering to assigned treatment, as shown in 
Figure 14.4. This does not correspond to any useful summary of individual 
causal effects. 


Intention To Treat 


丨 Randomized to Treatment ' ' Randomized to Control 


Treatment 

Treatment 

taken 

not taken 

always-takers 

never-takers 

or 

or 

compilers 

defiers 


Treatment 

Treatment 

taken 

not taken 

always-takers 

never-takers 

or 

or 

defiers 

compilers 


Per Protocol 


Figure 14.4 Compliance status and different ways of handling noncompliance 

An alternative treatment effect was introduced by Imbens and Rubin (1997b). 
Their 4 compiler average causal effect’ （ CACE) is the treatment effect among 
true compilers; the mean difference in outcome between compilers in the treat¬ 
ment group and those controls who would have complied with treatment had 
they been randomized to the treatment group. Thus, the CACE may be 
viewed as a measure of ‘efficacy’ as opposed to 4 effectiveness’. The crux of 
the CACE formulation is the distinction between a 4 true complier’ (compiler 
under both treatments) and an ‘observed compiler 5 (taking the assigned treat¬ 
ment). Hence, the challenge in modeling CACE is that the true compliance 
status of subjects is generally unknown. Formally, the compiler average causal 
effect (denoted 5 C ) is defined as 

S c = "lc _ /M)c ， 

where "i c is the mean outcome of compilers in the treatment group and "oc 
the mean outcome of compilers in the control group. 

In the JOBS II trial, the job training treatment was not available to those 
assigned to the control group. It is nevertheless plausible that there are always- 
takers who would take part in training if allowed regardless of assigned treat- 
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ment. However, we cannot estimate the probability of being an always-taker 
since always-takers in the control group cannot act as always-takers and 
always-takers in the treatment group are indistinguishable from compilers. 
One approach is to combine always-takers and compilers into a single group 
(e.g. Little and Yau, 1998) and another is to assume that there are no always- 
takers (e.g. Jo, 2002). We prefer the former approach and will loosely refer 
to the combined group of compilers and always-takers as 4 compliers’. The re¬ 
sulting 4 compiler 5 average causal effect is still meaningful since it represents 
the effect of training on those who will participate given the opportunity. The 
design for the JOBS II trial and the meaning of the CACE in this setting is 
shown in Figure 14.5, where the broken line indicates that we do not know who 
would take the treatment if offered and who wouldn’t in the control group. 
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Figure 14.5 Compliance Causal Average Effect (CACE) in job training trial 


We also assume that never-takers have the same mean in the treatment and 
control groups, 


MOn = f^ln 

(often called an ‘exclusion restriction’). 


1H3 CACE modeling 

Models for CACE are discussed in Angrist et al. (1996), Imbens and Rubin 
(1997ab), Little and Yau (1998) and Jo (2002). Formulation of this model as 
a latent class model with training data is described in Muthen (2002) and we 
use this approach here. 
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Compliance model 

Let Cj be a dummy variable for compilers (or always-takers) versus never- 
takers and rj a dummy for being randomized to the treatment versus control 
group. 

Compliance status is not observable in the control group and is thus rep¬ 
resented by r]j, a discrete latent variable with two values, ei = 1 ， e 2 = 0. 
Because of the randomization, we can set the parameters in the model for 
latent compliance in the control group equal to those for observed compliance 
in the treatment group, 

logit [7T1J-] = logit [Pr(cj = l\rj = 0)] = WjQ — logit [Pr(cj = l\tj = 1)]. 
The covariates for the compliance model are 

• [Age] age in years 

• [Motivate] motivation to attend 

• [Educ] school grade completed 

• [Assert] assertiveness 

• [Single] dummy for being single 

• [Econ] economic hardship 

• [Nonwhite] dummy variable for not being white versus white 
Depression model 

The outcome considered is the change in depression score (11-item subscale of 
Hopkins Symptom Checklist) from baseline to six months after the training 
seminars. 

If compliance were known in the control group, we could model depression 
as 

Vj = Po + PlCj(l - Vj) + p2CjVj + 6j, 

where ej •〜 N(O,0)，so that "on = Min = Po, Moc = A) + Pi, Mic = /?o + /?2 
and S c = /?2 — /?i- However, Cj in the second term (control group) is never 
observed. We therefore write the model in terms of latent compliance 7]j as 

Vj = /?0 + /?1%(1 - Vj) + p2CjVj + Cj. 

We can add covariates Xj with constant effects a across treatment groups 
by specifying 

Vj = 00 + x'jCX + fhrijCL - rj) + fhcjrj : ._j. 

The covariates considered for the depression model are 

• [Basedep] baseline depression score 

• [Risk] baseline risk score; an index based on depression, financial strain and 
assertiveness 
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Joint model 

In CACE modeling we have two responses, compliance Cj for the treatment 
group and depression y】*，for the entire sample. We denote these responses 
yij and 物， respectively, and define corresponding dummy variables d r i = 1 if 
r = i and zero otherwise. The response model can then be written in the GRC 
formulation as 

Vij = d il -w' j Q + f3 Q d 2 i + d 2i x , j a + r)jf3i(l - rj)d 2i + 02CjTjd 2i 
where a logit link and Bernouilli distribution are specified when i = 1 whereas 
an identity link and normal density are used when i = 2. The structural model 
is 

logit [ 71 ^] = w^g. 

Here we will replicate the analysis of the 4 high risk’ group presented in Lit¬ 
tle and Yau (1998). This group consisted of 335 subjects randomized to job 
training and 167 subjects randomized to the control group. Only 183 (55%) 
out of the 335 subjects randomized to job training actually participated. Pa¬ 
rameter estimates and standard errors for the CACE model with and without 
covariates in the compliance model are given in Table 14.6. 

Increasing age, motivation and education are associated with higher prob¬ 
ability of compliance while assertiveness reduces the probability. There seems 
to be a greater reduction in depression amongst those attending job training. 
The compiler average causal effect is estimated as —0.14 (0.14) when there 
are no covariates for compliance. A more pronounced compiler average causal 
effect of —0.31 (0.12) is obtained for the model with covariates. For compari¬ 
son, the conventional estimates of treatment effect, using baseline depression 
and risk as covariates for depression as in the CACE model, are 

1. Intention to treat: —0.15 (0.07) 

2. As treated (combining nonparticipants with control group): —0.18 (0.07) 

3. Per protocol (excluding nonparticipants) : —0.21 (0.08) 

As expected, the intention to treat effect is the smallest since inclusion of 
nonparticipants in the treatment group dilutes the treatment effect. It is in¬ 
teresting to compare the CACE and per protocol effects since both exclude 
never-takers from the treatment group, whereas only CACE also excludes 
them from the control group. If never-takers had a smaller treatment effect, 
the per protocol effect would be higher, but somewhat surprisingly this is not 
the case (/?i is positive). 

14.5 Physician advice and drinking: An endogenous treatment 
model 

14.5.1 Introduction 

Alcohol abuse is a significant public health concern, not only leading to health 
problems but also to for instance alcohol related traffic accidents. One ap¬ 
proach to reducing alcohol related problems is through physicians advising 
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Compliance model 
go [Cons] 

泛 l [Age] 

Q2 [Motivate] 

泛 3 [Educ] 
qa [Assert] 

^5 [Single] 

泛 6 [Econ] 
qq [Nonwhite] 

Depression model 
/3o [Cons] 

0i 

/?2 

Sc = P2 - Pi 
ai [Basedep] 
ai [Risk] 

0 

Log likelihood 


No covariates Covariates 
Parameter Est SE Est SE 


Table 14.6 Parameter estimates for CACE model with and without covariates for 
compliance model 


problem drinkers to reduce their consumption. The efficacy of physician ad¬ 
vice in reducing drinking has been demonstrated in controlled clinical trials. 
However, studies of the effect of physician advice based on observational data 
are required, since efficacy does not necessarily translate into effectiveness in 
everyday practice. 

Kenkel and Terza (2001) analyzed data from the 1990 National Health In¬ 
terview Survey core questionnaire and special supplements. The data comprise 
a sub-sample of 2467 males who are current drinkers and have been told that 
they have hypertension 4 . The response variable is the number of alcoholic 
beverages consumed in the last two weeks, i.e. a count variable. 28 percent of 
the drinkers report having been advised to reduce drinking. The objective is 
to estimate the 4 treatment effect’ of physician advice. 

In contrast to randomized studies, a major problem in estimating treatment 
effects from observational studies is that the treatment is often ‘endogenous’, 
in the sense that the treatment is correlated with unobserved heterogeneity. 

4 These data can be downloaded from the data archive of Journal of Applied Econometrics 

at http :// qed.econ.queensu.ca/jae/2001-vl6.2/kenkel-terza/. 
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For instance, subjects with a poor prognosis (unobserved heterogeneity) may 
self-select into the treatment perceived to be the best. It is hardly surprising 
that conventional modeling in this case can produce severely biased estimates 
of the treatment effect. Hence, models attempting to correct for selection bias, 
often called endogenous treatment models, have been suggested in economet¬ 
rics (e.g. Heckman, 1978). In this section we apply a class of endogenous 
treatment models for counts as outcome discussed by Terza and colleagues 
(e.g. Terza, 1998; Kenkel and Terza, 2001). 

14.5.2 Drinking model 

The main objective is to study the effect of 

• [Advice] answer to the question 4 Have you ever been told by a physician to 
drink less?’，treated as a dummy variable Tj for patient j. 

In addition, covariates included in Xj for the drinking process are: 

• [HiEduc] dummy for high education (> 12 years vs. < 12 years) 

• [Black] dummy for race (black vs. nonblack). 

For simplicity, this is a subset of the covariates used in the analyses reported 
by Kenkel and Terza (2001). 

Since the number of beverages consumed yj is a count, a natural approach 
is to consider a Poisson regression model 

Prfu .. u0 - Mfexp(- Mj ) 

rr ⑼， "j - ， 

where the expectation \ij is structured as a log-linear model 
log("j+) = aTj + x.^/3. 

Note that the treatment effect of [Advice] is represented by a in this model. 
Estimates for the model are given in the second column of Table 14.7. 

Since a zero response occurred for as many as 21 percent of the patients, we 
might consider models inducing overdispersion, inflating the number of zeros 
as compared to Poisson regression. A zero-inflated Poisson (ZIP) model of 
the kind discussed in Section 11.2 does not seem appropriate for the present 
application since abstainers are excluded from the sample. Thus, we consider 
a random intercept Poisson regression to model overdispersion, 

咖 j) = ocTj + + Cj, 

where ^ 〜 N(0, 畛). Estimates for this model are presented in the third column 
of Table 14.7. 

For both Poisson models [HiEduc] and [Black] have negative effects on drink¬ 
ing, which does not seem unreasonable. The important point to note, however, 
is that [Advice] has a rather perverse positive effect on drinking! Moreover, the 
sign does not change when numerous other potential confounders are included 
(see Kenkel and Terza，2001). Physician advice regarding problem drinking 
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Table 14.7 Estimates and standard errors for drinking and advice models 


Poisson 

Overdisp. 

Poisson Probit 

Endog. 

Treatment 

Parameter Est (SE) 

Est 

(SE) Est (SE) 

Est 

(SE) 

Fixed part 

Drinking model 
a [Advice] 0.47 (0.01) 

0.59 

(0.08) 

-2.42 

(0.23) 

/3 0 [Cons] 2.65 (0.01) 

1.43 

(0.06) 

2.32 

(0.09) 

Pi [HiEduc] -0.18 (0.01) 

0.02 

(0.07) 

-0.29 

(0.10) 

02 [Black] -0.31 (0.02) 

-0.29 

(0.11) 

0.20 

(0.11) 

Advice model 

7o [Cons] 


-0.48 (0.08) 

-1.13 

(0-16) 

71 [HiEduc] 

72 [Black] 

73 [Hlthlns] 


-0.25 (0.06) 
0.30 (0.08) 
-0.27 (0.07) 

-0.40 

0.60 

-0.33 

(0.10) 

(0.15) 

(0.10) 

74 [RegMed] 


0.18 (0.07) 

0.39 

(0.10) 

75 [Heart] 


0.17 (0.08) 

0.51 

(0.11) 

Random part 

Variance 

2.90 

(0.11) 

2.50 

(0.69) 

Loading 

A 

Log likelihood -32939.15 

-8857.85 一 1419.90 

1.43 (0.15) 

-10254.02 


should thus be abandoned if we were to take these results seriously. Note 
that there is considerable overdispersion with an estimated random intercept 
variance of 2.90. 

14-5.3 Advice model 

In order to gain some insight into the advice process, we specify and estimate 
a probit model for the treatment [Advice]. It is useful to formulate the model 
as a latent response model 

T* = w;' 7 柄， 

where 6j 〜 N(0,1). The treatment Tj is generated from the threshold model 

T = / i if7 7>o 

3 \ 0 otherwise. 

The covariate vector Wj contains three dichotomous health service utilization 
variables in addition to the covariates already introduced: 

• [Hlthlns] dummy for health insurance 
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• [RegMed] dummy for registered source of medical care 

• [Heart] dummy for heart condition. 

Once more, we have for simplicity included a subset of the covariates used in 
Kenkel and Terza (2001). 

Estimates for this model are presented in the third column of Table 14.7, 
where we see that [Black] ， [RegMed] and [Heart] have positive effects on [Ad¬ 
vice] ,whereas [HiEduc] and [Hlthlns] have negative effects. 


14-5.4 Joint model for drinking and advice: advice as endogenous treatment 

It is likely that there could be shared unobserved heterogeneity for the drink¬ 
ing and advice processes, since physicians may be more likely to give advice 
to patients considered at risk. Hence, we consider a simultaneous model for 
drinking and advice, a so-called endogenous treatment model 

!og(Mi) = aTj + x^./3 + AC,-, 

T ] = w i7 + Ci + 

where ‘ 〜 N(O ， 0) is a factor representing shared unobserved heterogeneity, 
A is a factor loading and e) 〜 N(0,1). Note that this factor model is simply 
a reparametrization of the model considered by Terza (1998), reducing the 
dimension of integration from 2 to 1 • 

It follows from the endogenous treatment model that 

Varllog^OlTj ^Xj] = X 2 ip > 0, 
thus allowing for overdispersion, 

Va,r[Tj\wj] =0 + 1, 

and 

CovIlog^O.T/lwj] = Xifi. 

Note that due to the increased residual variance in the probit model, the 
coefficients 7 are expected to increase by a factor of about + 1. 

Viewing the responses i — 1 for drinking and i = 2 for advice as clustered 
within patients j, we define the dummy variables du = 1 for drinking and 
d2i = l for advice. Using the GRC notation, the endogenous treatment model 
can be written as 

Vij = d u [aTj + x//3 + ACj] + d 2i [w'jj + Q] 

= ocduTj + duKj'P + d 2 jW^-7 + Q [^liA + d 2i ], 
with log link and Poisson distribution for i 二 1 and probit link for i = 2. A 
path diagram of this model is shown in Figure 14.6, where w^- = (x^,z^). 

An important feature of the advice model is that Tj and 。 are dependent, 
which implies that Tj and ‘ become dependent in the drinking model. [Ad¬ 
vice] is thus 4 endogenous’ in the drinking model and valid inference regarding 
the treatment effect a must in general be based on the simultaneous model 
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Figure 14.6 Path diagram for endogenous treatment model 


including both drinking and advice as response or 4 endogenous’ variables. In 
contrast, the restriction 入 = 0 decomposes the likelihood into a component 
for the conventional Poisson model for drinking and a probit component for 
advice (equal to that considered above if -0 = 0). 

Importantly, randomization of patients to [Advice] would render the treat¬ 
ment exogenous, since Tj would become independent of the unobserved het¬ 
erogeneity Cj. I n this case [Advice] becomes 4 exogenous 5 and valid inference 
regarding a can be obtained from the conventional Poisson regression model. 
Observe that A governs the sign of the covariance and that a shared random 
intercept model (for drinking and advice) is obtained if A = l. 

The switching regime model is a generalization of the endogenous treat¬ 
ment model. Here, a probit model governs whether a patient is allocated to 
treatment or nontreatment, 

y*j = w j7 + Cy + C2j + 

where Cij •〜 N(0,-i/；ii) and (^ 〜 N(0, 岭 22 )， independently distributed, and 
independent from ej. In the treatment regime drinking is modeled as 

log(M2j) = /3 2 0 + Xj-/3 2 + A 2 C 2 j, 
and in the nontreatment regime as 

lo g(My) = /3io + + AiCy, 

where Xj no longer includes a constant. The treatment effect now becomes 
a = P20 — Pio- Note that a patient is only allocated to one of the regimes and 
never both, the responses in the regimes thus represent potential outcomes. 
Unfortunately, attempts to obtain reliable estimates for the switching regimes 
model failed for this application. 

Estimates for the endogenous treatment model are presented in the fourth 
column of Table 14.7. The main point to note is that the effect of physician 
advice is now reversed. Receiving advice leads to reduced drinking，consistent 
with the results from controlled clinical trials. Introducing further covariates 
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(see Kenkel and Terza, 2001) does not alter this conclusion. Since A is positive, 
there appears to be a positive correlation between the unobserved heterogene¬ 
ity for the drinking and advice processes. Thus, conditional on the covariates 
(the observed heterogeneity) the patients most prone to be heavy drinkers 
are also most likely to receive drinking advice. As expected, including unob¬ 
served heterogeneity furthermore increases the magnitude of the estimates in 
the advice model. 

Note that the endogenous treatment model imposes so-called exclusion re¬ 
strictions (different from the exclusion restriction in CACE) since the effects 
of the health service utilization variables are implicitly set to zero in the drink¬ 
ing part of the model. According to Kenkel and Terza (2001)，the restrictions 
can be motivated from standard models for demand for alcohol as a consumer 
good. Although beneficial for identification，the restrictions are not necessary 
for identification. Relaxing the restrictions did not alter the main results ap¬ 
preciably. Results should be interpreted with extreme caution when this is not 
the case. 

It is useful to note that a sample selection model for counts is obtained 
as the special case of the endogenous treatment model where a = 0 and Tj 
plays the role of sample selection indicator, taking the value 1 if subject j 
is included in the sample and 0 otherwise. This model is a generalization of 
the sample selection model originally suggested by Heckman (e.g. Heckman, 
1979) for continuous response. 

14.6 Treatment of liver cirrhosis: A joint survival and marker 
model 

14-6.1 Introduction 

Andersen et al. (1993) analyzed a randomized controlled trial with Danish 
patients suffering from histologically verified liver cirrhosis (severely damaged 
liver, often caused by alcohol abuse). The patients were randomized to either 
treatment with the hormone prednisone or to placebo. 488 patients were con¬ 
sidered for whom the initial biopsy could be reevaluated using more restrictive 
criteria, of whom 251 received prednisone and 237 placebo 5,6 . Treatment with 
prednisone is denoted [Treat] and the corresponding dummy variable as T in 
the sequel. 

The main purpose of the trial was to ascertain whether prednisone reduces 
the death hazard for cirrhosis patients. Patients were considered censored 
if lost to follow-up or alive at the end of the observation period. Repeated 
measurements of prothrobin, a biochemical marker of liver functioning, were 
also obtained. A prothrobin index was based on a blood test of coagulation 
factors II, VII and X produced by the liver (we divide the original index by 10 
to avoid very small regression coefficients). We also consider a dichotomized 

5 We thank Per Kragh Andersen for providing us with these data. 

6 The data can be downloaded from gllaimn.org/books 


© 2004 by Chapman & Hall/CRC 







version of the index, defining a value as normal if the original index is higher 
than 70 and otherwise as abnormal. 

The measurements of prothrobin were scheduled to take place 3, 6 and 12 
months after randomization and thereafter once a year but the actual follow¬ 
up times were irregular. Thus, one problem is a highly unbalanced design and 
missing measurements for the marker. 

Another problem, specific for the survival setting, is that values of the time- 
varying covariate are required at each failure time for each subject surviving 
beyond this time (see Display 2.1 on page 43 for risk-set expansion of data). 
However, measurements of the marker are not available at each failure time 
so some kind of interpolation method must be used. Christensen et al. (1986) 
made the somewhat unrealistic assumption that time-varying covariates were 
constant between follow-up occasions. Instead, we propose using a growth 
curve model for the marker process. 

It also seems plausible that liver functioning could be an intervening vari¬ 
able, implying that prednisone could have an indirect effect on the death 
hazard via liver functioning in addition to a direct treatment effect. Disen¬ 
tangling these effects may first of all improve our understanding of how the 
treatment works. Furthermore, it may shed light on whether the marker can 
be regarded as a ‘surrogate’ for survival. If there is no direct effect, survival is 
conditionally independent of treatment given the marker. Thus, survival in¬ 
formation conveys no additional information on the treatment effect once the 
marker is known and the marker is called a 4 perfect surrogate’ (e.g. Prentice, 
1989). The marker may then be used as response variable instead of survival, 
which can be beneficial for various practical reasons, for instance by reducing 
the required follow-up time. 

To address these questions, we specify a structural equation survival model, 
composed of a latent marker model and a hazard model with a latent covariate. 


14-6.2 Latent marker model 

Let Uj be the time of the ith measurement occasion for patient j. The observed 
marker is denoted at Uj and is related to the latent or 4 true’ marker rjij 
via the measurement model 

Vij = A) + r ?|^+^， 

where e" 〜 N(O，0). 

A structural model for the latent marker is specified as 

= 71^+72^ + ^, (14.2) 

where n) 3 ) 〜 N(0, 分） is independently distributed from Note that there is 
no disturbance in the structural model or equivalently Var((^ 2 )) = 0. We 
also considered a more general structural model, including a random slope for 
time in addition to the random intercept (bivariate normal random intercept 
and slope), but the log-likelihood did not increase appreciably. 
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Substituting for the latent marker in the measurement model, the re¬ 
duced form marker model becomes 


Hij = Po ll^ij 12^j + ”;）+ ^ij 5 

a random intercept linear growth model. Note that is often interpreted 
as measurement error (e.g. Faucett and Thomas, 1996; Wulfsohn and Tsiatis, 
1997; Xu and Zeger, 2001), although this is problematic since the term also 
incorporates the deviation of the true marker from the linear time trend. 
The two error components cannot be separated since there is only a single 
measurement per patient at each time point. 

We will also consider a probit model for the dichotomous marker variable 
‘normal versus abnormal prothrobin 5 , using the 70% cut off for the prothrobin 
index. The marker model is similar to that outlined above, with the latent 
response y*j taking the place of yij. The dichotomous measurements are gen¬ 
erated from the threshold model 


Vij = 


0 


if^>0 

otherwise. 


14-6.3 Hazard model with latent covariate 

Let t r j be the rth death time survived by patient j and let the hazard at t r j 
be denoted h r j. A proportional hazards model with the latent marker (at 
time t r j) as covariate is specified as: 

In h rj = In h° rj + + a^Tj 

with baseline hazard modeled as 

In = ao + ait r j + 0^5 + 

This third degree polynomial was the chosen model from a systematic search 
among third degree fractional polynomials (e.g. Royston and Altman, 1994). 

We then interpolate the latent marker to the required death times t r j by 
substituting for r]^ from the structural marker model (14.2), obtaining the 
reduced form hazard model 

In h r j = A(7i^ r j + ^Tj + ryj 3) ) + a^Tj 

=ln/i^ + [A72 + a 4 ]Tj + Xr]f\ 

where 

In h^.j = a。+ [A71 + ai]t r j + + ast^j 

is the reduced form baseline hazard. 

We see that the total treatment effect A72 + on the log hazard is decom¬ 
posed into the indirect effect A72 mediated through the latent marker and a 
direct effect a4，If there is no direct effect = 0), the log hazard is con¬ 
ditionally independent of treatment given the marker. In this case, survival 
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information conveys no additional information on the treatment effect if the 
marker is known, and the marker is a ‘perfect surrogate' 

A path diagram of the model is shown in Figure 14.7, where the letters z, r 
and j represent indices over which the variables within each frame vary and 
D r j is a dummy variable taking the value 1 if patient j dies in risk set r and 
0 otherwise (the response for the survival model) • 



Figure 14.7 Path diagram for structural equation hazards model 

Estimates and standard errors for the models with continuous and dichoto¬ 
mous markers are shown in Table 14.8. Note that we have fixed 9, which is 
not identified in the probit marker model, to the estimate for the continuous 
marker model to ease comparison of the estimates and their standard errors. 
For the marker model, we note that the estimates for the continuous and 
dichotomous case are quite similar. The exception is [Cons] which takes on 
different roles in the models, being closely related to a threshold in the dichoto- 
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Marker model 
0o [Cons] 

71 [Time] 

72 [Treat] 

0 

Hazard model 
ao [Cons] 
[Time] 
a 2 [Time] 2 
a3 [Time] 3 
q ；4 [Treat] 

A 

Log-likelihood 


mous case. As expected, the standard errors are higher for the probit, due to 
the loss of information incurred from dichotomization. The estimates suggest 
that the marker increases over time but is reduced by [Treat]. The residual 
intraclass correlations for the continuous observed marker and the underlying 
variable for the dichotomous marker are 0.52 and 0.58, respectively. 

Regarding the hazard model, both estimates and standard errors are similar, 
whether the observed marker was dichotomized or not. The exception concerns 
the negative estimate for [Treat], which is lower in absolute value in the di¬ 
chotomous case. The treatment effect on liver functioning (the latent marker) 
is estimated as 72 = —0.64，with 95% confidence interval (-1.00,-0.28). 
Thus, the hormone prednisone has a negative 4 side-effect，on liver^ function¬ 
ing. The estimated effect of liver functioning on death hazard is A = —0.38 
with 95% confidence interval (—0.46，一0.31). The corresponding hazard ratio is 
exp(A) = 0.68. As expected, good liver functioning reduces the death hazard. 

The direct treatment effect on the death hazard was estimated as S4 = 
—0.18 (hazard ratio 0.84) with 95% confidence interval (—0.42,0.06). Thus, 
the direct effect of treatment appears to reduce the hazard a bit, but the es¬ 
timate is imprecise. The indirect treatment effect is estimated as A72 = 0.25 
(hazard ratio 1.28) with 95% confidence interval (0.08,0.41). This increased 
death hazard is due to the negative side-effect of treatment on liver func¬ 
tioning, which in turn transmits to an increased hazard. The total treatment 


Table 14.8 Estimates for structural equation hazard models, using both continuous 
and dichotomous versions of the marker 
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effect, the sum of the direct and indirect effects, is estimated as A72+S4 = 0.07 
(hazard ratio 1.07) with 95% confidence interval (—0.22,0.35). Hence, there 
is no evidence for a beneficial treatment effect of prednisone on the death 
hazard for cirrhosis patients. Interestingly, prednisone is no longer used as a 
treatment of liver cirrhosis. 

Regarding surrogacy, the direct effect 0:4 does not differ significantly from 0 
at the 5% level, so we might conclude that the marker is a perfect surrogate. 
However, this does not appear to make sense since the indirect effect of the 
marker is detrimental. Furthermore, even a perfect surrogate is not as infor¬ 
mative as survival times, if our aim is to estimate the size of the treatment 
effect on survival, rather than just testing for a treatment effect. 


14.7 Summary and further reading 

We discussed covariate measurement error models for both continuous and dis¬ 
crete true covariates in both prospective and retrospective studies, for the case 
of repeated measurements and validation samples. Covariate measurement er¬ 
ror models are discussed in Carroll et al. (1995a) and Gustafson (2004). Roeder 
et al. (1996), Schafer (2001)，Aitkin and Rocci (2002) and Rabe-Hesketh et 
al (2003a) also consider nonparametric maximum likelihood estimation in 
this context. 

This chapter illustrates how mathematically very similar models can be used 
for very different problems. For example, the model structure used in the joint 
modeling of the marker and survival process is similar to that discussed for the 
diet and heart disease application in Section 14.2. If the random intercept is 
removed in the marker model, the models in both applications have the same 
structure apart from the different response types of the outcome (durations in 
marker and survival example and dichotomous responses in the diet and heart 
disease example). However, the interpretation of the models is quite different. 

The models for the job training and cervical cancer applications also have 
very similar structures. In both examples, a latent class model is used for a 
dichotomous covariate (compliance and exposure to herpes virus) that is per¬ 
fectly observed in a subsample. In both models, the prevalence of the true 
covariate categories and relationship between true covariate and outcome are 
assumed to be the same in both subsamples. In the job training example, fur¬ 
ther information on latent class membership comes from covariates for com¬ 
pliance, whereas in the cervical cancer application, further information comes 
from imperfect measurements of exposure to herpes virus available in both 
subsamples. 

A very interesting application of multiple process modeling is given in Lil- 
lard (1993)，who considered a simultaneous equations survival model for mar¬ 
riage duration and fertility. We are currently only at the beginning of compre¬ 
hending the scope of latent variable modeling of multiple processes and mixed 
responses. However, we expect that there will be exciting future developments 
in this area. 
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